[MB-12224] Active vBuckets on one server drops to zero when increasing replia count from 1 to 2 Created: 22/Sep/14  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Anil Kumar Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Scenario - Increasing the replica set from 1 to 2

1. 5 node cluster, single bucket 'default'
2. On bucket edit settings change the settings for replica count from 1 to 2 and Save
3. Rebalance operation
4. Monitoring Stats - vBucket Resources shows while rebalance operation is happening active vbucket counts drops less than 1024 (attached screenshots)

Expectation - We needed to increase the copies of data without affecting the active vbuckets.

What we are seeing is partial active vbuckets were not available at that?

 Comments   
Comment by Aleksey Kondratenko [ 22/Sep/14 ]
This is likely simply due to inconsistency of stats between nodes.

We need logs to diagnose this. Giving us access to machines is not as useful because we don't plan to deal with with bug soon.




[MB-10789] Bloom Filter based optimization to reduce the I/O overhead Created: 07/Apr/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
A bloom filter can be considered as an optimization to reduce the disk IO overhead. Basically, we maintain a separate bloom filter per vbucket database file, and rebuild the bloom filter (e.g., increasing the filter size to reduce a false positive error rate) as part of vbucket database compaction.

As we know the number of items in a vBucket database file, we can determine the number of hash functions and the size of the bloom filter to achieve the desired false positive error rate. Note that Murmur hash has been widely used in Hadoop and Cassandra because it is much faster than MD5 and Jenkins. It has been widely known that fewer than 10 bits per element are required for a 1% false positive probability, independent of the number of elements in the set.

We expect that having a bloom filter will enhance both XDCR and full-ejection cache management performance at the expense of the filter's memory overhead.



 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
Design Document:
https://docs.google.com/document/d/13ryBkiLltJDry1WZV3UHttFhYkwwWsmyE1TJ_6tKddQ




[MB-12223] Test Automation Advancements for Sherlock Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Major
Reporter: Raju Suravarjjala Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
placeholder for all the test automation work




[MB-12201] Hotfix Rollup Release Created: 16/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Cihan Biyikoglu Assignee: Raju Suravarjjala
Resolution: Unresolved Votes: 0
Labels: hotfix
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
representing the rollup hotfix for 2.5.1 that includes all hotfixes (without the V8 change) release to date (sept 2014)

 Comments   
Comment by Dipti Borkar [ 16/Sep/14 ]
is this rollup still 2.5.1? it will create lots of confusion. can we tag it 2.5.2? or does that lead to another round of testing? there are way too many hot fixes so really need a new . release.
Comment by Cihan Biyikoglu [ 17/Sep/14 ]
Hi Dipti, to improve the hotfix management, we are changing the way we'll do hotfixes. the rollup will bring in more hotfixes together and ensure we provide customers the all fixes we know about. if we fixed an issue already at the time you requested your hotfix, there is not reason why we should risk exposing you to known+fixed issues in the version you are using. side effects of this should also improve life for support.
-cihan




[MB-11060] Build and test 3.0 for 32-bit Windows Created: 06/May/14  Updated: 22/Sep/14  Due: 09/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Chris Hillery Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7/8 32-bit

Issue Links:
Dependency
Duplicate

 Description   
For the "Developer Edition" of Couchbase Server 3.0 on Windows 32-bit, we need to first ensure that we can build 32-bit-compatible binaries. It is not possible to build 3.0 on a 32-bit machine due to the MSVC 2013 requirement. Hence we need to configure MSVC as well as Erlang on a 64-bit machine to produce 32-bit compatible binaries.

 Comments   
Comment by Chris Hillery [ 06/May/14 ]
This is assigned to Trond who is already experimenting with this. He should:

 * test being able to start the server on a 32-bit Windows 7/8 VM

 * make whatever changes are necessary to the CMake configuration or other build scripts to produce this build on a 64-bit VM

 * thoroughly document the requirements for the build team to reproduce this build

Then he can assign this bug to Chris to carry out configuring our build jobs accordingly.
Comment by Trond Norbye [ 16/Jun/14 ]
Can you give me a 32 bit windows installation I can test on. My MSDN license have expired and I don't have Windows media available (and the internal wiki page just have a limited set of licenses and no download links)

Then assign it back to me and I'll try it
Comment by Chris Hillery [ 16/Jun/14 ]
I think you can use 172.23.106.184 - it's a 32-bit Windows 2008 VM that we can't use for 3.0 builds anyway.
Comment by Trond Norbye [ 24/Jun/14 ]
I copied the full result of a build where I set target_platform=x86 on my 64 bit windows server (the "install" directory) over to a 32 bit windows machine and was able to start memcached and it worked as expected.

Our installers make other magic like install the service etc needed in order to start the full server. Once we have such an installer I can do further testing
Comment by Chris Hillery [ 24/Jun/14 ]
Bin - could you take a look at this (figuring out how to make InstallShield on a 64-bit machine create a 32-bit compatible installer)? I won't likely be able to get to it for at least a month, and I think you're the only person here who still has access to an InstallShield 2010 designer anyway.
Comment by Bin Cui [ 04/Sep/14 ]
PM should make the call that whether or not we want to have 32bit support for windows.
Comment by Anil Kumar [ 05/Sep/14 ]
Bin - As confirmed back in March-April supported platforms for Couchbase Server 3.0 - we decided to continue to build 32bit Windows for Development-Only support. As mentioned in our documentation deprecation page http://docs.couchbase.com/couchbase-manual-2.5/deprecated/#platforms.

Comment by Bin Cui [ 17/Sep/14 ]
1. create a 64bit builder with 32bit target.
2. Create a 32bit builder.
3. Transfer 64bit staging image to 32bit builder
4. Run the packaging steps and generate the final package out of 32bit builder.
Comment by Chris Hillery [ 18/Sep/14 ]
Bin - when we discussed this a few weeks ago I had thought you were going to be driving forward on the details of implementing this. There are a few steps here that we don't know how to do. I will work with Trond to figure out how to enable step #1, as it sounds like he has accomplished most of that locally. Steps 2 and 3 I think we (build team) can figure out.

Step 4, though, is what I was referring to in my comment on 24/Jun/14. You are the only person in the company, so far as I know, who has both understanding and access to InstallShield. I feel sure that this is going to require making changes to our project (or, worse, creating a new project) to create a 32-bit installer. I need you to figure out how to do that, or this task will not be completed.

Assigning this back to Bin for now, although I will work on figuring out how to enable steps 1-3 of his proposed workflow.
Comment by Wayne Siu [ 19/Sep/14 ]
Chris,
Can you give us an update on steps 1/2/3 by Monday (09.22)? Thanks.
Comment by Chris Hillery [ 22/Sep/14 ]
I have updated the Windows build script to accept an explicit architecture (x86 / amd64) and am currently re-packaging Trond's depot of third-party x86 dependencies for my cbdeps mechanism. If the build does not then succeed, I will try to work with Trond tonight on debugging it.

In the meantime, I could use some help with step 2 - either we need to just use the existing 2.5.1 x86 build slave, or else we'll want to clone it to a new VM that can still run InstallShield 2010. At that point we should be able to get Bin involved.




[MB-12219] HINTs for N1QL to suggest index selection and execution path Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
ability to specify hints for index selection and execution path -
index selection scenario: multiple predicates in the WHERE and ORDER BY where it isn't obvious from stats which index out of many to pick. User gets to suggest one to N1QL.
execution path scenarios: type of join to apply or optimize for fast first results vs fast total execution etc.

pointing at sherlock but we can live without this in v1.

 Comments   
Comment by Cihan Biyikoglu [ 22/Sep/14 ]
feel free to push out of the sherlock release if this isn't being done in sherlock.
-cihan




[MB-11642] Intra-replication falling far behind under moderate-heavy workload when XDCR is enabled Created: 03/Jul/14  Updated: 22/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: 0h
Time Spent: 47h
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File ep_dcp_replica_items_remaining.png     PNG File ep_upr_replica_items_remaining.png     PNG File latency_observe.png     PNG File OPS_during_rebalance.png     PNG File Repl_items_remaining_after_rebalance.png     PNG File Repl_items_remaining_before_rebalance.png     PNG File Repl_items_remaining_during_rebalance.png     PNG File Repl_items_remaining_start_of_rebalance.png     PNG File Screen Shot 2014-07-15 at 11.47.19 AM.png     PNG File Screen Shot 2014-08-13 at 10.21.24 AM.png    
Issue Links:
Relates to
relates to MB-11643 Incoming workload suffers when XDCR e... Resolved
relates to MB-11640 DCP Prioritization Open
relates to MB-11984 Intra-cluster replication slows down ... Open
relates to MB-11675 20-30% performance degradation on app... Closed
Triage: Triaged
Is this a Regression?: Yes

 Description   
Running the "standard sales" demo that puts a 50/50 workload of about 80k ops/sec across 4 nodes of m1.xlarge, 1 bucket 1 replica.

The "intra-cluster replication" value grows into the many k's.

This is a value that our users look rather closely at to determine the "safety" of their replication status. A reasonable number on 2.x has always been below 1k but I think we need to reproduce and set appropriate baselines for ourselves with 3.0.

Assigning to Pavel as it falls into the performance area and we would likely be best served if this behavior was reproduced and tracked.

 Comments   
Comment by Pavel Paulau [ 03/Jul/14 ]
Well, I underestimated your definition of moderate-heavy.)

I'm seeing similar issue when load is about 20-30K set/sec. I will create a regular test and will provide all required for debugging information.
Comment by Pavel Paulau [ 09/Jul/14 ]
Just wanted to double check, you can drain 10K documents/sec with both 2.5 and 3.0 builds, is that right?

UPDATE: actually 20K/sec because of replica.
Comment by Pavel Paulau [ 10/Jul/14 ]
In addition to replication queue (see attached screenshot) I measured replicateTo=1.

On average it looks better in 3.0 but there are quite frequent lags as well. Seems to be a regression.

Logs for build 3.0.0-943:
http://ci.sc.couchbase.com/job/ares/308/artifact/

My workload:
-- 4 nodes
-- 1 bucket
-- 40M x 1KB docs (non-DGM)
-- 70K mixed ops/sec (50% reads, 50% updates)
Comment by Perry Krug [ 13/Jul/14 ]
Pavel, I'm still seeing quite a few items sitting in the "intra-replication queue" and some spikes up into the low thousands. I'm using build 957.

The spikes seem possibly related to indexing activity and when I turn XDCR on, it gets _much_ worse.

Let me know if you need any logs from me or anything else I can do to help reproduce and diagnose.
Comment by Pavel Paulau [ 13/Jul/14 ]
Well, initial issue description didn't mention anything about indexing or XDCR.

Do you see problem during KV-only workload? Also logs are required at this point.
Comment by Perry Krug [ 15/Jul/14 ]
Hey Pavel, here is a set of logs from my test cluster running the same workload I described earlier with one design document (two views). This is just the stock beer-sample dataset with a random workload on top of it.

You'll also see a few minutes after this cluster started up, that I turned on XDCR. The "items remaining" in the intra-replication queue shot up to over 1M and have not gone down. It also appears that the UPR drain rate for both XDCR, replication and views has nearly stopped completely with very sporadic spikes (see the recently attached screenshot)

I'm raising this to a blocker since it seems quite significant that the addition of XDCR was able to completely stop the URP drain rates and has so negatively impacted our HA replication within a cluster.

Logs are at:
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.196.75.236.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.196.81.119.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.198.10.83.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.198.52.42.zip

This is on build 957
Comment by Perry Krug [ 15/Jul/14 ]
Just tested on build 966 and seeing similar behavior:
With above workload (70k ops/sec 50/50 gets/sets):
-K/V-only, the intra-replication queue is around 200-400, with an occasional spike up to 10k. I believe this still to be unacceptable and for our customers to complain about it
-K/V+views: intra-replication queue baseline around 1-2k, with frequent spikes up to 4k-5k.
-K/V+XDCR (no views): intra-replication queue immediately begins growing significantly when XDCR is added. The drain rate for intra-replication is sometimes half what the drain rate for XDCR is (mb-11640 seems clearly needed here). The intra-replication queue reaches about 300k (across the cluster) and then starts going down once the XDCR items have been drained. It then hovers just under 200k. Again, not so acceptable.
-K/V+XDCR+1DD/2Views: Again, the intra-replication queue grows and does not seem to recover. It reaches over 1M and then the bucket ran out of memory and everything seemed to shut down. I'll be filing a separate issue for that.
Comment by Pavel Paulau [ 19/Jul/14 ]
It's actually easily reproducible, even in KV cases.

It appears that infrequent sampling misses most spikes, but "manual" observation detects occasional bursts (up to 60-70K in my run).
Comment by Thomas Anderson [ 31/Jul/14 ]
retest of application with latest 3.0.0-1069 build shows minor regression compared with 2.5.1 for intra-replication.
same 4 node server system; 2 views, KV document, target 80K OPS 50/50 ratio; added replicate node; rebalanced; performance comparible.
in 2.5.1. steady state ep_upr_replica_items_remainaining 200-400 with periodic 10K spikes
in 3.0.0-1069. steady state before and after add node/rebalance, ep_upr_replica_items_remaining 200-500 with periodic 60K-75K spikes, no more frequently than 2.5.1.
in 2.5.1. OPS ~80KOPS (with 2 views) in 3.0.0-1069 ~77KOPS; both drop to 60KOPS during rebalance
Comment by Thomas Anderson [ 31/Jul/14 ]
believe 3.0.0-1069 shown to address most of the regression from 2.5.1: OPS and intra-replication queue depth originally reported.
Comment by Perry Krug [ 01/Aug/14 ]
Apologies Thomas, I'm not seeing much of a difference yet and there are still too many situations where the intra-replication queue is being shown as 70k during steady state and 500k with XDCR enabled
Comment by Wayne Siu [ 02/Aug/14 ]
Thomas,
Should we have omeone from Dev to also take a look at this ticket? We need to decide if this should be resolved before Beta2 by Monday Aug 4, 2014.
Comment by Wayne Siu [ 04/Aug/14 ]
In today's beta blocker meeting (PM, Dev, and QE), we agreed that we should continue to work on this issue and PM has agreed to remove the beta blocker tag off this ticket.
Comment by Chiyoung Seo [ 04/Aug/14 ]
Mike,

I saw that the replication backlog size was spiked during the rebalance. I suspect it is mainly caused by the fact that we include the vbucktet takeover backlog size in the replication backlog size. If that's the case, I don't think it is correct. The vbucket takeover backlog size should not be included, but tracked separately. Can you confirm this?
Comment by Mike Wiederhold [ 05/Aug/14 ]
I just want to note here that this isn't actually a bug and is the current expected behavior. Last time we discussed the stats with the support team we agreed that rebalance and replication stats would be merged together on the UI. The reason for the large spikes is that a vbucket is move is taking place. If there are a lot of items then the items remaining to be sent for that particular vbucket will cause a large spike on the UI because we have a lot of items to replicate in order to move that vbucket. It is also not a trivial task to just not count the vbuckets that are being moved as part of this stat and I will need to think of a good way to do this.
Comment by Perry Krug [ 05/Aug/14 ]
Also note that this was originally filed with no rebalance going on, and I am still able to reproduce the spikes under steady-state load, and adding XDCR causes the queue to grow uncontrollably up to 500k items and it takes a long time to recover.

I think I can see from the stats that XDCR and Views UPR streams are draining as fast or faster than the intra-cluster replication streams sometimes.
Comment by Sundar Sridharan [ 07/Aug/14 ]
Just ran a pure OBSERVE latency test using Java client for 1000 SET with OBSERVE until replica one.
2.5.0 average latency is 396ms
3.0.0 average latency is 163ms
Which means 3.0 OBSERVE performs 2X better.

Also, on Perry's setup, with XDCR and views turned off, we see that intra-cluster replication catches up quickly and the replication queues stay close to 0. It is only when XDCR and views are enabled, on Perry's setup we see that intra cluster replication falls behind when there are lot of incoming mutations.
Comment by Sundar Sridharan [ 07/Aug/14 ]
Perry, Just a clarification on the stats as explained to me by Mike - We cannot read the items_remaining stat for intra-cluster replication the same way we read it for xdcr and views. The reason is that the stats ep_dcp_replica_items_remaining is computed differently from the stats ep_dcp_xdcr_items_remaining and ep_dcp_views_items_remaining. For intra-cluster replication, the consumer opens a DCP stream with end sequence number as infinity, however for xdcr and views, the streams are created with explicit end sequence numbers. So the items remaining for the intra-cluster replication stream is computed as the number of items remaining in the queues. However for the xdcr and views, they are computed as end sequence number - start sequence number. As a result, the intra-cluster replication items remaining would always show higher values than xdcr and views, because the former is a continuous stream while the latter 2 are discrete.

In summary, it is quite possible that intra-cluster replication is actually much faster than xdcr or views, but the UI stats do not communicate this. Mike and I discussed with Alk to see if we can show a more reliable value for the xdcr queues, but looks like it is not a trivial change.

The real bug here I think would be the DCP drain rates staying at zero even though there are a large number of items remaining to be sent.
In my testing I have not hit this case yet.

Hope this explanation helps. thanks
Comment by Perry Krug [ 07/Aug/14 ]
Thank you Sundar, that does help very much.

However, we are still seeing many thousands to many hundreds-of-thousands of items sitting in the intra-replication queue and I think that's where the main problem lies...even if we don't compare it with the other queues.

Regarding the drain rates reaching 0, I have not seen that either on more recent builds so may no longer be an issues.
Comment by Perry Krug [ 08/Aug/14 ]
As an update from my side, I did some further testing with the same workload on nodes with 8-cores and 16-cores.

Under steady-state k/v with views configured, I don't see much improvement in the intra-replication queue size with 8 cores versus 4. 16 cores does show a bit of a benefit here, but it still is not quite as good as 2.5.1. (this is all default configuration of 3.0 build 1123)

When I added XDCR into the mix, 8 cores was noticeably better than 4 w.r.t the intra-replication queue size, but I still saw it stabilize around 80k items (still too high). 16 cores was even better and could keep the queue down around its steady-state levels, but not constantly 0.
Comment by Pavel Paulau [ 08/Aug/14 ]
+ my 2¢.

Build 3.0.0-1105 (aka beta-2 candidate)
9-nodes cluster, 40 vCPU, 10 GbE
50K inserts/sec
No views, no XDCR

I'm observing occasional 60-80K spikes in ep_dcp_replica_items_remaining.
Comment by Sundar Sridharan [ 11/Aug/14 ]
I have been trying repeatedly to reproduce the issue, but since doing a repo sync today morning, I am consistently seeing much higher drain rate for intra-cluster replication when compared with XDCR and views.
I don't know if this was a result of some other code merge, but just thought of updating this ticket, incase anyone else has also seen the severity of this issue reduce with recent build?
thanks
Comment by Perry Krug [ 11/Aug/14 ]
I just tested with build 1135 and still see similar behavior as previous builds.
Comment by Pavel Paulau [ 11/Aug/14 ]
The same characteristics in my setup.
Comment by Chiyoung Seo [ 12/Aug/14 ]
Perry, Pavel,

Sundar made the following change for batching the replication stream more efficiently and saw the better replication queue drain rate (up to 2x):

http://review.couchbase.org/#/c/40392/

Please run your test again to see how much it improves the replication performance.

We will continue to look at other optimizations that we can make.

Comment by Pavel Paulau [ 13/Aug/14 ]
I just saw >1M documents in replication queue during data load with rate about 30K ops/sec (4 nodes). No views, no xdcr.
Build 1143.
Comment by Perry Krug [ 13/Aug/14 ]
I just attached a screenshot (https://www.couchbase.com/issues/secure/attachment/21560/Screen%20Shot%202014-08-13%20at%2010.21.24%20AM.png) that shows the effect of adding XDCR. You can see that when the drain rate of XDCR-DCP goes up to 16k, the replication-DCP drain rate goes from around 30k to 11k and the queue begins to grow. Not sure if this is helpful, just adding a bit of the specific symptom that i'm seeing.

This was also on build 1143 and is consistent with what I've seen in previous builds.
Comment by Sundar Sridharan [ 14/Aug/14 ]
Perry, Pavel, could you please try the toy build
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-MB11642-toy.rpm
and return the output of the
cbstats localhost:12000 dcpthreads
This is to get an idea of the threading behavior w.r.t DCP queues.
thanks
Comment by Perry Krug [ 15/Aug/14 ]
Hi Sundar, I'm still seeing basically the same behavior when enabling XDCR: the intra-replication DCP drain rate drops from about 35k to 12k and the XDCR DCP drain rate goes from 0 to 22k. The intra-replication queue grows up to about 700k and then very slowly drains whereas the XDCR queue goes up and comes down very quickly (I know the numbers are not directly comparable, but it is still indicative of the rates and amount of backlog).

Thanks again for your dedication to this,

The output of that command on all nodes is:
[perry@ip-10-196-91-114 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1113930048: 4701251
 replica_thread_1145399616: 2401500
 replica_total: 7102751
 views_total: 0
 xdcr_thread_1113930048: 227480
 xdcr_thread_1124419904: 228327
 xdcr_thread_1134909760: 1286037
 xdcr_thread_1145399616: 232247
 xdcr_total: 1974091

[perry@ip-10-198-29-175 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1114642752: 2392544
 replica_thread_1135622464: 2374468
 replica_thread_1146112320: 2406743
 replica_total: 7173755
 views_total: 0
 xdcr_thread_1114642752: 1390002
 xdcr_thread_1125132608: 273590
 xdcr_thread_1135622464: 268636
 xdcr_thread_1146112320: 269322
 xdcr_total: 2201550

[perry@ip-10-198-2-92 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1129421120: 4851498
 replica_thread_1139910976: 2417591
 replica_total: 7269089
 views_total: 0
 xdcr_thread_1108441408: 297918
 xdcr_thread_1118931264: 291598
 xdcr_thread_1129421120: 291698
 xdcr_thread_1139910976: 1448237
 xdcr_total: 2329451

[perry@ip-10-198-2-106 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1121360192: 2429881
 replica_thread_1131850048: 2439924
 replica_thread_1142339904: 2454502
 replica_total: 7324307
 views_total: 0
 xdcr_thread_1110870336: 308203
 xdcr_thread_1121360192: 308451
 xdcr_thread_1131850048: 1491118
 xdcr_thread_1142339904: 313941
 xdcr_total: 2421713

Keep in mind that the workload was running for a few minutes longer than the XDCR when I took this snapshot, I didn't see a way of resetting the thread counts so I'm not sure if that skewed the numbers. Does it appear that they are not evenly balanced?
Comment by Sundar Sridharan [ 15/Aug/14 ]
Thanks Perry, this looks interesting. When there are only 4 memcached worker threads, it appears that xdcr having more connections ends up getting higher priority and stealing away bandwidth from the replica threads.
If possible could you restart the workload with xdcr enabled from start - just to remove the skew and confirm the behavior? thanks in advance
Comment by Perry Krug [ 15/Aug/14 ]
Here you are Sundar:
[perry@ip-10-198-10-236 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1123735872: 275722
 replica_thread_1134225728: 267791
 replica_thread_1155205440: 273189
 replica_total: 816702
 views_total: 0
 xdcr_thread_1123735872: 264577
 xdcr_thread_1134225728: 264944
 xdcr_thread_1144715584: 614348
 xdcr_thread_1155205440: 274266
 xdcr_total: 1418135

[perry@ip-10-198-41-152 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1102215488: 273263
 replica_thread_1127586112: 269062
 replica_thread_1148565824: 271401
 replica_total: 813726
 views_total: 0
 xdcr_thread_1102215488: 262477
 xdcr_thread_1127586112: 262804
 xdcr_thread_1138075968: 608167
 xdcr_thread_1148565824: 272084
 xdcr_total: 1405532

[perry@ip-10-198-10-207 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1116973376: 275649
 replica_thread_1127463232: 274226
 replica_thread_1137953088: 270203
 replica_total: 820078
 views_total: 0
 xdcr_thread_1116973376: 334701
 xdcr_thread_1127463232: 199998
 xdcr_thread_1137953088: 616727
 xdcr_thread_1148442944: 277805
 xdcr_total: 1429231

[perry@ip-10-198-52-82 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1116117312: 275408
 replica_thread_1126607168: 263023
 replica_thread_1137097024: 266781
 replica_total: 805212
 views_total: 0
 xdcr_thread_1116117312: 264366
 xdcr_thread_1126607168: 263195
 xdcr_thread_1137097024: 607506
 xdcr_thread_1147586880: 271143
 xdcr_total: 1406210

It does look like XDCR is getting just a little less than 2x the time that replication is. Also, I have DCP traffic for views, but it doesn't look like it's being captured here, perhaps something you'll want to add.

FWIW, these stats seem like they could be quite useful in the future, perhaps making it so that the "cbstats reset" clears the counters would be helpful in the field if we need to troubleshoot something?
Comment by Sundar Sridharan [ 15/Aug/14 ]
Thanks Perry, will consider adding this stat for production and also implement the view stats.
Awaiting the results from Thomas before recommending a fix.
Comment by Sundar Sridharan [ 15/Aug/14 ]
Perry, Thomas had found that on machines that have just 4 cores, reducing the "XDCR Max Replications per Bucket" from 16 to 4 does not choke intra-cluster replication and also has only minor impact on xdcr latencies.
Perhaps, if you have some time, if possible, can we verify this on your setup too, since the issue is clearly present in your setup?
thanks in advance
Comment by Perry Krug [ 17/Aug/14 ]
Hi Sundar, it looks like that did help quite a bit. The intra-replication queue is way way down and the dcp stats seem to show that replication is actually getting a bit more time now:
[perry@ip-10-198-21-69 ~]$ /opt/couchbase/bin/cbstats localhost:11210 dcpthreads -b beer-sample
 replica_thread_1132874048: 529216
 replica_thread_1153853760: 519213
 replica_thread_1164343616: 498097
 replica_total: 1546526
 views_total: 0
 xdcr_thread_1132874048: 109795
 xdcr_thread_1143363904: 111192
 xdcr_thread_1153853760: 388307
 xdcr_thread_1164343616: 112447
 xdcr_total: 721741

Overall a definite improvement.

However, I would still point out that the intra-replication queue spikes upwards of 20k and is overall a bit higher with XDCR enabled than without. The CPU usage hasn't really changed, it's still around 75-80%. The spikes seem to be about every 30 seconds and correlate to a spike in the XDCR outbound mutations as well, which goes up to a few 100k and then slowly drains over the next 30 seconds before they both spike again.

I then tried setting it to 3 XDCR replicators and saw even better behavior in the intra-replication queue depth, it didn't exhibit any spikes over a few thousand.

This is definitely much much better, but I still think that we should be uncomfortable with the intra-replication traffic being so heavily impacted by the XDCR traffic.

Also, I think we should apply the same logic we use elsewhere to have the number of XDCR replicators default to 75% of cores instead of just always 16 (likely make the default max at 16 regardless of higher cores). Is that something we could get into 3.0 GA?

Thanks again for your continued work!

Perry
Comment by Pavel Paulau [ 18/Aug/14 ]
Separating my issue - MB-11984.
Comment by Sundar Sridharan [ 18/Aug/14 ]
Sure Perry, We already discussed setting default value for number of replicators to be 75% of the number of cores as you can see that easily corresponds to the cpu utilization as well - since all the memcached worker threads are being used for either replication or xdcr.
Comment by Sundar Sridharan [ 18/Aug/14 ]
Alk, Could you please help set the default number of XDCR replicators from 16 to 75% of the number of cores to help intra-cluster replication? thanks
Comment by Aleksey Kondratenko [ 18/Aug/14 ]
No. We cannot go much below 16. 16 is not about concurrency on source. It's about dealing with limitation of outbound architecture. We need many outgoing per-vbucket replications to limit our badness in dealing with WAN between source and destination clusters.
Comment by Pavel Paulau [ 18/Aug/14 ]
I kind of disagree with this auto-tuning policy for XDCR replicators.
It doesn't take into account too many factors (topology, number of buckets, network latency, and etc.).
Comment by Sundar Sridharan [ 18/Aug/14 ]
Thanks Alk, Pavel, then we definitely need a release note documenting this workaround in case a customer hits this issue which, from your comments, seems less likely to begin with. do you agree?
Comment by Raju Suravarjjala [ 18/Aug/14 ]
Triage: We will document this issue in release note for 3.0 and continue to look at it for 3.0.1
Comment by Sundar Sridharan [ 25/Aug/14 ]
Assigning to Anil for proper release note documentation triaging. thanks




[MB-11640] DCP Prioritization Created: 03/Jul/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Perry Krug Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-11642 Intra-replication falling far behind ... Reopened

 Description   
It would be a valuable design improvement to allow for a high/low priority on DCP streams similar to how we handle bucket IO priority.

Intra-cluster DCP should always be prioritized over anything else to protect against single node failure and especially false-positive autofailover

Others such as XDCR and views should initially be low priority, with a future improvement to allow for end-user configuration if needed.

 Comments   
Comment by Mike Wiederhold [ 10/Sep/14 ]
I discussed this issue with Trond since I most of the work will need to be done in memcached. Resolving this issue will likely require some architectural changes and will need to be planned for a minor release.
Comment by Mike Wiederhold [ 10/Sep/14 ]
Assigning to Anil for planning since this is not a small change.




[MB-12179] Allow incremental pausable backfills Created: 12/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Currently ep-engine requires that backfills run from start to end and cannot be paused. This creates a problem for a few reasons. First off, if a user has a large dataset then we will potentially need to backfill a large amount of data from disk and into memory. Without the ability to pause and resume a backfill we cannot control the memory overhead created from reading items off of disk. This can affect the resident ratio if the data that needs to be read by the backfill is large.

A second issue is that this means that we can only run one (or two if there are enough cpu cores) backfill at a time and all backfill must be run serially. In the future we plan on allowing more DCP connections to be created to a server. If many connections require backfill we may have some connections that do not receive data for an extended period of time because these connections are waiting for their backfills to be scheduled.




[MB-11143] Avg. BgFetcher wait time is still 3-4 times higher on a single HDD in 3.0 Created: 16/May/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = HDD

Attachments: PNG File bg_wait_time.png    
Issue Links:
Dependency
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: 2.5.1:
http://ci.sc.couchbase.com/job/thor-64/637/artifact/

3.0.0-680:
http://ci.sc.couchbase.com/job/thor-64/643/artifact/
Is this a Regression?: Yes

 Description   
Read-heavy KV workload:
-- 4 nodes
-- 1 bucket x 200M x 1KB (20-30% resident)
-- 2K ops/sec
-- 1 / 80 / 18 / 1 C/R/U/D
-- 15-20% cache miss rate (most will miss page cache as well)

Average latency is still 5-6 times higher in 3.0. Histograms:

https://gist.github.com/pavel-paulau/e9b8ab4d75b9a662ff07

 Comments   
Comment by Pavel Paulau [ 16/May/14 ]
I know you made several fixes.

But at least this workload still looks bad.

In the meanwhile, I will test SSD (both cheap and expensive).
Comment by Chiyoung Seo [ 16/May/14 ]
Thanks Pavel for identifying this issue.

Sundar and I found recently that there are still some issue in scheduling the global threads.

Sundar,

Please take a look at this issue too. Thanks!
Comment by Sundar Sridharan [ 16/May/14 ]
Hi Pavel,
did you see a recent regression within 3.0 or is it from 2.5 only?
thanks
Comment by Pavel Paulau [ 16/May/14 ]
From 2.5 only.
Comment by Chiyoung Seo [ 16/May/14 ]
Sundar,

We saw this regression in 3.0 compared with 2.5
Comment by Sundar Sridharan [ 16/May/14 ]
Just to help narrow down the cause, is the increased latency seen in 3.0 with just 2 shards as opposed to 4? thanks
Comment by Chiyoung Seo [ 16/May/14 ]
The machine has 24 cores. I think Pavel used the default settings of our global thread pool.
Comment by Pavel Paulau [ 17/May/14 ]
On cheap SSD drives it's only ~50% percents slower.
Comment by Pavel Paulau [ 19/May/14 ]
On fast drives (RAID 10 SSD) it looks a little bit better.

It means that issue applies only to cheap saturated devices. Not unexpected.
Comment by Chiyoung Seo [ 19/May/14 ]
Pavel,

Now, I remembered that we had some performance regression in bg fetch requests when the disk is slow HDD (e.g., single volume) or commodity SSD drives. Can you test it with more advanced HDD settings like RAID 10?

The PM team mentioned that most of our customers use more and more advanced HDD drives or enterprise-version SSD or Amazon EC2 SSD instance.
Comment by Pavel Paulau [ 19/May/14 ]
It's 4x slower with 2 shards (Sundar requested this benchmark).

Le'ts discuss the problem once we get setups with RAID 10 HDD (CBIT-1158).
Comment by Pavel Paulau [ 21/Jun/14 ]
I was finally able to run the same tests on SSD and RAID 10 HDD. Everything looks good, there is no regression.

We also replaced slow single HDD disks with faster ones. On average 3.0.0 builds look 3-4 times slower. Comparison of histograms is below:

2.5.1-1083 (logs - http://ci.sc.couchbase.com/job/leto/125/artifact/):

 bg_wait (338200 total)
    4us - 8us : ( 0.01%) 46
    8us - 16us : ( 0.35%) 1128
    16us - 32us : ( 7.14%) 22988 ##
    32us - 64us : ( 18.33%) 37816 ####
    64us - 128us : ( 31.60%) 44909 #####
    128us - 256us : ( 32.25%) 2193
    256us - 512us : ( 32.49%) 805
    512us - 1ms : ( 32.79%) 1016
    1ms - 2ms : ( 33.95%) 3916
    2ms - 4ms : ( 37.52%) 12089 #
    4ms - 8ms : ( 46.04%) 28795 ###
    8ms - 16ms : ( 58.17%) 41044 #####
    16ms - 32ms : ( 71.92%) 46474 #####
    32ms - 65ms : ( 84.73%) 43333 #####
    65ms - 131ms : ( 95.06%) 34931 ####
    131ms - 262ms : ( 99.52%) 15109 #
    262ms - 524ms : ( 99.99%) 1584
    524ms - 1s : (100.00%) 24

3.0.0-849 (logs - http://ci.sc.couchbase.com/job/leto/129/artifact/):

 bg_wait (339115 total)
    4us - 8us : ( 0.00%) 6
    8us - 16us : ( 0.03%) 101
    16us - 32us : ( 3.36%) 11291 #
    32us - 64us : ( 20.82%) 59206 #######
    64us - 128us : ( 39.39%) 62969 #######
    128us - 256us : ( 40.56%) 3984
    256us - 512us : ( 40.76%) 681
    512us - 1ms : ( 40.98%) 722
    1ms - 2ms : ( 41.89%) 3087
    2ms - 4ms : ( 43.73%) 6236
    4ms - 8ms : ( 47.58%) 13064 #
    8ms - 16ms : ( 53.79%) 21074 ##
    16ms - 32ms : ( 63.65%) 33410 ####
    32ms - 65ms : ( 75.12%) 38904 ####
    65ms - 131ms : ( 85.72%) 35968 ####
    131ms - 262ms : ( 93.55%) 26550 ###
    262ms - 524ms : ( 97.93%) 14844 #
    524ms - 1s : ( 99.38%) 4904
    1s - 2s : ( 99.81%) 1486
    2s - 4s : ( 99.96%) 509
    4s - 8s : (100.00%) 119
Comment by Pavel Paulau [ 21/Jun/14 ]
It's too late to improve characteristics in 3.0 but I strongly recommend to keep it open and consider possible optimization later.
Comment by Sundar Sridharan [ 23/Jun/14 ]
It may not be too late yet, bg fetch latencies are very important esp with full eviction. The root cause could be either due to increased number of writing threads or just a scheduling issue. If it is the former, the fix may be nontrivial as you suggest, however, if it is the latter, then it should be easy to fix. hope to have a fix for this soon.
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 because we observed lower latency on SSD and RAID HDD environments, but only saw the performance regressions on a single HDD.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
Agreed. single HDD is a dev scenario and not perf critical.
Comment by Sundar Sridharan [ 31/Jul/14 ]
Pavel,
fix: http://review.couchbase.org/#/c/40080/ and
fix: http://review.couchbase.org/#/c/40084/
is expected to improve bgfetch latencies significantly.
Could you please investigate this with heavy-DGM scenarios?
thanks
Comment by Pavel Paulau [ 04/Aug/14 ]
Read performance has regressed since our last conversation.

Recent changes improved situation but number are still higher that in 2.5.1.

Stats from build 3.0.0-1097 with compaction_number_of_kv_workers =1:

bg_wait (338947 total)
    4us - 8us : ( 0.00%) 11
    8us - 16us : ( 0.05%) 151
    16us - 32us : ( 2.86%) 9518 #
    32us - 64us : ( 20.28%) 59054 #######
    64us - 128us : ( 33.59%) 45124 #####
    128us - 256us : ( 34.64%) 3564
    256us - 512us : ( 34.89%) 821
    512us - 1ms : ( 35.09%) 685
    1ms - 2ms : ( 35.97%) 2976
    2ms - 4ms : ( 37.66%) 5740
    4ms - 8ms : ( 41.07%) 11573 #
    8ms - 16ms : ( 46.15%) 17197 ##
    16ms - 32ms : ( 55.17%) 30585 ###
    32ms - 65ms : ( 66.84%) 39569 ####
    65ms - 131ms : ( 79.62%) 43292 #####
    131ms - 262ms : ( 90.31%) 36231 ####
    262ms - 524ms : ( 96.68%) 21602 ##
    524ms - 1s : ( 98.93%) 7620
    1s - 2s : ( 99.71%) 2648
    2s - 4s : ( 99.95%) 803
    4s - 8s : ( 99.99%) 159
    8s - 16s : (100.00%) 24

All logs: http://ci.sc.couchbase.com/job/leto-dev/1/artifact/




[MB-12159] Memcached throws an irrelevant message while trying to update a locked key Created: 09/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Triage: Untriaged
Is this a Regression?: No

 Description   
A simple test to see if updates are possible on locked keys

def test_lock(self):
        src = MemcachedClient(host=self.src_master.ip, port=11210)
        # first set
        src.set('pymc1098', 0, 0, "old_doc")
        # apply lock
        src.getl('pymc1098', 30, 0)
        # update key
        src.set('pymc1098', 0, 0, "new_doc")

throws the following Memcached error -

  File "pytests/xdcr/uniXDCR.py", line 784, in test_lock
    src.set('pymc1098', 0, 0, "new_doc")
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 163, in set
    return self._mutate(memcacheConstants.CMD_SET, key, exp, flags, 0, val)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 132, in _mutate
    cas)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 128, in _doCmd
    return self._handleSingleResponse(opaque)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 121, in _handleSingleResponse
    cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 117, in _handleKeyedResponse
    raise MemcachedError(errcode, rv)
MemcachedError: Memcached error #2 'Exists': Data exists for key for vbucket :0 to mc 10.3.4.186:11210






[MB-12169] Unexpected disk creates during graceful failover Created: 10/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
4-node cluster with beer-sample bucket plus 300k items. Workload is 50/50 gets/sets, but sets are over same 300k items constantly.

When I do a graceful failover of one node, I see a fair amount of disk creates even though no new data is being inserted.

If there is a reasonable explanation great, but I am concerned that there may be something incorrect going on either with the identification of new data or the movement of vbuckets.

Logs are here:
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-193-230-57.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-23-198.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-29-139.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-40-174.us-west-1.compute.amazonaws.com.zip




[MB-12104] Carrier Config missing after R/W Concurrency change Created: 01/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Michael Nitschinger Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks JCBC-537 Setting Reader/Writer Worker value on... Open
Triage: Untriaged
Is this a Regression?: No

 Description   
Hi,

while investigating JCBC-537 I found what I think is a pretty severe issue when changing the R/W concurrency.

When it is changed on the UI, the GET_CFG command from CCCP is returning successful, but with an empty response. Once the server(s) are restarted, the config is back there.

This manifests in JCBC-537 when tested with 2.5.1 (and the threads are changed), but persists even with 3.0 when just set to high. Again, the same issue persists on single and multinode. I tried to restart the single node (3.0) when set to high, and after it came back up the command worked.

I guess some processes need to be restarted in order to take effect and they are not picking up a binary config afterwards?

I set it to blocker because this is already harming production boxes, feel free to lower it. As far as I can see, workaround is to restart the cluster after the setting is made.

 Comments   
Comment by Aleksey Kondratenko [ 02/Sep/14 ]
It may actually be ns_server bug. Easiest way to test is to try killing ns_server's beam.smp and see if it heals the problem. I think it will.
Comment by Sundar Sridharan [ 02/Sep/14 ]
Thanks Alk, I was just in the process of updating the bug too - cluster config is set by the PROTOCOL_BIRNARY_CMD_SET_CLUSTER_CONFIG, which needs to be called by ns_server if the bucket is restarted due to a change like read/write concurrency setting on the UI.
ep-engine only caches this map and returns it when clients connect, it cannot persist it across restarts.
Comment by Aleksey Kondratenko [ 02/Sep/14 ]
Lowered severity to major given that:

* there's easy workaround

* bucket settings are not expected to be changed frequently

Plan to fix it as part of 3.0.1.




[MB-11627] Expose DCP as a public protocol for change tracking Created: 02/Jul/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Cihan Biyikoglu Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 1
Labels: upr
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Enable publicly supported API for DCP (UPR) that can allow clients to open a UPR stream, detect changes for a given bucket and a given filter on keys or values.

 Comments   
Comment by Matt Ingenthron [ 29/Aug/14 ]
Filtering on operation type too, say for cache invalidation at other tiers.




[MB-12211] Investigate noop not closing connection in case where a dead connection is still attached to a failed node Created: 18/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
See MB-12158 for information on how to reproduce this issue and why it needs to be looked at on the ep-engine side.




[MB-12117] Access log generation holds onto locks causing a performance slow down. Created: 03/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Critical
Reporter: Patrick Varley Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency

 Description   
Copied DaveR comment from CBSE-1349:

"""
I was looking at the code for the access log scanner, and it looks like we actually perform disk I/O (to write the access.log) *while holding the internal HashTable bucket lock*. This means that any operations which need to mutate documents in HashTable buckets covered by that lock will be blocked (!)

From the code:

  http://src.couchbase.org/source/xref/2.5.1/ep-engine/src/access_scanner.cc#31 - Visit each element, creating new log item...
  http://src.couchbase.org/source/xref/2.5.1/ep-engine/src/mutation_log.cc#91 - Write entry into an (in-memory) buffer...
  http://src.couchbase.org/source/xref/2.5.1/ep-engine/src/mutation_log.cc#421 - Add to buffer; once buffer is full then flush buffer to disk.

We probably want to separate log building and file I/O, to minimise the length of time the HashTable bucket lock is held (and hence minimise customer impact).
"""

 Comments   
Comment by Cihan Biyikoglu [ 03/Sep/14 ]
any way to quantify how much perf improvement this will bring and under what operations?
Comment by Patrick Varley [ 03/Sep/14 ]
It looks to be a 20% drop in write operations, if you are using sync operations then that will have a knock on affect to get operations. See the graph in CBSE-1349




[MB-12145] {DCP}:: After Rebalance ep_queue_size Stat gives incorrect info about persistence Created: 08/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 1208, 10.6.2.145-10.6.2.150

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.145-982014-1126-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.145-982014-1143-couch.tar.gz
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.146-982014-1129-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.146-982014-1143-couch.tar.gz
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.147-982014-1132-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.147-982014-1143-couch.tar.gz
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.148-982014-1135-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.148-982014-1143-couch.tar.gz
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.149-982014-1138-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.149-982014-1144-couch.tar.gz
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.150-982014-1141-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12145/10.6.2.150-982014-1144-couch.tar.gz
Is this a Regression?: Yes

 Description   

1. Create 6 cluster
2. Create default bucket with 10 K items
3. After ep_queue_size =0, take snap-shot of all data using cbtransfer (for couchstore files)
4. Rebalance-out 1 Node
5. After ep_queue_size =0, sleep for 30 seconds, take snap-shot of all data using cbtransfer (for couchstore files)

Step 5 and Step 3 shows inconsistency in expected keys as we find some keys missing. We also do data verification using another client which does not fail. Also, active and replica items counts are as expected. Issue seen in our expected items in couch store files

mike1651

 mike6340

 mike8616

 mike5380

 mike2691

 mike4740

 mike6432

 mike9418

 mike9769

 mike244

 mike7561

 mike5613

 mike6743

 mike2073

 mike1252

 mike4431

 mike9346

 mike4343

 mike9037

 mike6866

 mike2302

 mike3652

 mike7889

 mike2998

Note that on increasing the delay after we see ep_queue_size =0, from 30 to 60 to 120, we still hit issue when some keys are missing. Had adjusted the delay to 240 seconds and did not see the missing keys.

This is a not a case of data loss. Only stats (ep_queue_size =0) are incorrect. I have verified cbtransfer functionality and it does not break during the test runs.

Test Case:: ./testrunner -i ~/run_tests/palm.ini -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_after_ops,nodes_out=1,replicas=1,items=10000,skip_cleanup=True

Also, with vbuckets=128 this problem does not repro. So please try it for 1024 vbuckets.

Seen this issues in different places for failover+rebalance.



 Comments   
Comment by Ketaki Gangal [ 12/Sep/14 ]
Run into same issue with ./testrunner -i /tmp/rebal.ini active_resident_threshold=100,dgm_run=true,get-delays=True,get-cbcollect-info=True,eviction_policy=fullEviction,max_verify=100000 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_after_ops,nodes_out=1,replicas=1,items=10000,GROUP=OUT

It uses same verification method as above and fails due to ep_queue_size stat
1. Create cluster
3. After ep_queue_size =0, take snap-shot of all data using cbtransfer (for couchstore files)
4. Rebalance-out 1 Node
5. After ep_queue_size =0, sleep for 30 seconds, take snap-shot of all data using cbtransfer (for couchstore files)




[MB-12160] setWithMeta() is able to update a locked remote key Created: 09/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: all, 3.0.0-1208

Attachments: Zip Archive 10.3.4.186-992014-168-diag.zip     Zip Archive 10.3.4.188-992014-1611-diag.zip    
Triage: Untriaged
Is this a Regression?: No

 Description   
A simple test to check if setWithMeta() refrains from updating a locked key-

Steps
--------
1. uni-xdcr on default bucket from .186 --> .188
2. create a key 'pymc1098' with "old_value" on .186
3. sleep for 10 secs, it gets replicated to .188.
4. Now getAndLock() on 'pymc1098' on .188 for 20s
5. Meanwhile, update same key at .186
6. After 10s(lock should not have expired now, also see timestamps in test log below), do a getMeta() at source and dest, they match
Destination key contains "new_doc".


def test_replication_after_getAndLock_dest(self):
        src = MemcachedClient(host=self.src_master.ip, port=11210)
        dest = MemcachedClient(host=self.dest_master.ip, port=11210)
        self.log.info("Initial set = key:pymc1098, value=\"old_doc\" ")
        src.set('pymc1098', 0, 0, "old_doc")
       # wait for doc to replicate
        self.sleep(10)
       # apply lock on destination
        self.log.info("getAndLock at destination for 20s ...")
        dest.getl('pymc1098', 20, 0)
       # update source doc
        self.log.info("Updating 'pymc1098' @ source with value \"new_doc\"...")
        src.set('pymc1098', 0, 0, "new_doc")
        self.sleep(10)
        self.log.info("getMeta @ src: {}".format(src.getMeta('pymc1098')))
        self.log.info("getMeta @ dest: {}".format(dest.getMeta('pymc1098')))
        src_doc = src.get('pymc1098')
        dest_doc = dest.get('pymc1098')


2014-09-09 15:27:13 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] Initial set = key:pymc1098, value="old_doc"
2014-09-09 15:27:13 | INFO | MainProcess | test_thread | [xdcrbasetests.sleep] sleep for 10 secs for doc to be replicated ...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getAndLock at destination for 20s ...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] Updating 'pymc1098' @ source with value "new_doc"...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [xdcrbasetests.sleep] sleep for 10 secs. ...
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getMeta @ src: (0, 0, 0, 2, 16849348715855509)
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getMeta @ dest: (0, 0, 0, 2, 16849348715855509)
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] src_doc = (0, 16849348715855509, 'new_doc')
dest_doc =(0, 16849348715855509, 'new_doc')

Will attach cbcollect.

 Comments   
Comment by Aruna Piravi [ 09/Sep/14 ]
Causes inconsistency when the server by itself disallows set but allows set through setWithMeta when locked.




[MB-11373] "Error: internal (memcached_error)" seen when raw document is sent with compressed datatype. Created: 10/Jun/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
Steps to reproduce:
1)Using a 3.x client send a binary raw document with the datatype set as "raw compressed" to the server.
2)From UI, go to Documents page /index.html#sec=documents&viewsBucket=default&documentsPageNumber=0
3)The error, "Error: internal (memcached_error)" is seen.
4)From a 2.x client issue a GET for this document, the error
2014-06-10 11:14:56.785 INFO com.couchbase.client.CouchbaseConnection: Reconnection due to exception handling a memcached operation on {QA sa=, #Rops=1, #Wops=0, #iq=0, topRop=Cmd: 0 Opaque: 1 Key: fooaaa, topWop=null, toWrite=0, interested=1}. This may be due to an authentication failure.
OperationException: SERVER: Internal error
at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:166)

is seen.
5)This is due to server trying to uncompress a raw document.

 Comments   
Comment by Abhinav Dangeti [ 17/Jun/14 ]
Since this was before Datatype support was disabled in 3.0, just confirming: the client did do the HELLO exchange in your scenario right?
Comment by Abhinav Dangeti [ 17/Jun/14 ]
Point (5) is correct, uncompression of an uncompressed document is failing.
On the documents page, the internal_memcached_error that shows up is very likely a result of the absence of the HELLO exchange again (this time by the view client), as it tries to uncompress and non-compressed document.
Comment by Abhinav Dangeti [ 18/Jun/14 ]
On another note,if the client isn't datatype compliant, memcached will try to uncompress documents through snappy. So how about if snappy_uncompress fails we send the document as is?
Comment by Trond Norbye [ 18/Jun/14 ]
I may be reading this wrong but it sounds to me that you're sending just raw data, but you're telling the system that it is compressed... I wouldn't expect that to work..
Comment by Abhinav Dangeti [ 18/Jun/14 ]
I guess that is the scenario that Venu is attempting with a HELLO compliant client though (as a negative test),
Comment by Trond Norbye [ 18/Jun/14 ]
For the test to be "valid" he would have to have a new compatible client and:

1) send hello and have the server accept the use of datatype
2) compress the data with snappy and store it as compressed with the bit set
3) connect with another client, don't do hello
4) request the data.

It's like copying a pdf-version of /etc/passwd in as /etc/passwd on your server and expect it to work..
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 as the datatype support is not in 3.0
Comment by Abhinav Dangeti [ 08/Sep/14 ]
http://review.couchbase.org/#/c/41263/




[MB-10788] Enhance the OBSERVE durability support by leveraging VBucket UUID and seq number Created: 07/Apr/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: feature-backlog
Fix Version/s: sherlock
Security Level: Public

Type: Bug Priority: Critical
Reporter: Chiyoung Seo Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The current OBSERVE command is based on CAS returned from the server, which consequently can't provide the correct tracking for failover scenarios especially. To enhance it, we will investigate leveraging VBucket UUID and seq number to provide a better tracking for various failover or soft / hard node shutdown scenarios.

 Comments   
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
There are a number of reasons why this comes up regularly with customers.
- they see replication as a better way to create durability over local node disk persistence.
- this does allow replica reads without compromise on consistency
Comment by Chiyoung Seo [ 02/Sep/14 ]
We discussed the high-level design that is still based on polling-based approach like OBSERVE, but can provide the better aspects in terms of performance (e.g., latency) and tracking replication. Sriram will write up the design doc and share it later.




[MB-12192] XDCR : After warmup, replica items are not deleted in destination cluster Created: 15/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x, 3.0.1-1297-rel

Attachments: Zip Archive 172.23.106.45-9152014-1553-diag.zip     GZip Archive 172.23.106.45-9152014-1623-couch.tar.gz     Zip Archive 172.23.106.46-9152014-1555-diag.zip     GZip Archive 172.23.106.46-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.47-9152014-1558-diag.zip     GZip Archive 172.23.106.47-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.48-9152014-160-diag.zip     GZip Archive 172.23.106.48-9152014-1624-couch.tar.gz    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Steps
--------
1. Setup uni-xdcr between 2 clusters with atleast 2 nodes
2. Load 5000 items onto 3 buckets at source, they get replicated to destination
3. Reboot a non-master node on destination (in this test .48)
4. After warmup, perform 30% updates and 30% deletes on source cluster
5. Deletes get propagated to active vbuckets on destination but replica vbuckets only experience partial deletion.

Important note
--------------------
This test had passed on 3.0.0-1208-rel and 3.0.0-1209-rel. However I'm able to reproduce this consistently on 3.0.1. Unsure if this is a recent regression.

2014-09-15 14:43:50 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Testcase
------------
./testrunner -i /tmp/bixdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,reboot=dest_node,items=5000,rdirection=unidirection,replication_type=xmem,standard_buckets=1,sasl_buckets=1,pause=source,doc-ops=update-delete,doc-ops-dest=update-delete

On destination cluster
-----------------------------

Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.47:11210,172.23.106.48:11210
VBucket 512: active count 4 != 6 replica count

VBucket 513: active count 2 != 4 replica count

VBucket 514: active count 8 != 11 replica count

VBucket 515: active count 3 != 4 replica count

VBucket 516: active count 8 != 10 replica count

VBucket 517: active count 5 != 6 replica count

VBucket 521: active count 0 != 1 replica count

VBucket 522: active count 7 != 11 replica count

VBucket 523: active count 3 != 5 replica count

VBucket 524: active count 6 != 10 replica count

VBucket 525: active count 4 != 6 replica count

VBucket 526: active count 4 != 6 replica count

VBucket 528: active count 7 != 10 replica count

VBucket 529: active count 3 != 4 replica count

VBucket 530: active count 3 != 4 replica count

VBucket 532: active count 0 != 2 replica count

VBucket 533: active count 1 != 2 replica count

VBucket 534: active count 8 != 10 replica count

VBucket 535: active count 5 != 6 replica count

VBucket 536: active count 7 != 11 replica count

VBucket 537: active count 3 != 5 replica count

VBucket 540: active count 3 != 4 replica count

VBucket 542: active count 6 != 10 replica count

VBucket 543: active count 4 != 6 replica count

VBucket 544: active count 6 != 10 replica count

VBucket 545: active count 3 != 4 replica count

VBucket 547: active count 0 != 1 replica count

VBucket 548: active count 6 != 7 replica count

VBucket 550: active count 7 != 10 replica count

VBucket 551: active count 4 != 5 replica count

VBucket 552: active count 9 != 11 replica count

VBucket 553: active count 4 != 6 replica count

VBucket 554: active count 4 != 5 replica count

VBucket 555: active count 1 != 2 replica count

VBucket 558: active count 7 != 10 replica count

VBucket 559: active count 3 != 4 replica count

VBucket 562: active count 6 != 10 replica count

VBucket 563: active count 4 != 5 replica count

VBucket 564: active count 7 != 10 replica count

VBucket 565: active count 4 != 5 replica count

VBucket 566: active count 4 != 5 replica count

VBucket 568: active count 3 != 4 replica count

VBucket 570: active count 8 != 10 replica count

VBucket 571: active count 4 != 6 replica count

VBucket 572: active count 7 != 10 replica count

VBucket 573: active count 3 != 4 replica count

VBucket 574: active count 0 != 1 replica count

VBucket 575: active count 0 != 1 replica count

VBucket 578: active count 8 != 10 replica count

VBucket 579: active count 4 != 6 replica count

VBucket 580: active count 8 != 11 replica count

VBucket 581: active count 3 != 4 replica count

VBucket 582: active count 3 != 4 replica count

VBucket 583: active count 1 != 2 replica count

VBucket 584: active count 3 != 4 replica count

VBucket 586: active count 6 != 10 replica count

VBucket 587: active count 3 != 4 replica count

VBucket 588: active count 7 != 10 replica count

VBucket 589: active count 4 != 5 replica count

VBucket 591: active count 0 != 2 replica count

VBucket 592: active count 8 != 10 replica count

VBucket 593: active count 4 != 6 replica count

VBucket 594: active count 0 != 1 replica count

VBucket 595: active count 0 != 1 replica count

VBucket 596: active count 4 != 6 replica count

VBucket 598: active count 7 != 10 replica count

VBucket 599: active count 3 != 4 replica count

VBucket 600: active count 6 != 10 replica count

VBucket 601: active count 3 != 4 replica count

VBucket 602: active count 4 != 6 replica count

VBucket 606: active count 7 != 10 replica count

VBucket 607: active count 4 != 5 replica count

VBucket 608: active count 7 != 11 replica count

VBucket 609: active count 3 != 5 replica count

VBucket 610: active count 3 != 4 replica count

VBucket 613: active count 0 != 1 replica count

VBucket 614: active count 6 != 10 replica count

VBucket 615: active count 4 != 6 replica count

VBucket 616: active count 7 != 10 replica count

VBucket 617: active count 3 != 4 replica count

VBucket 620: active count 3 != 4 replica count

VBucket 621: active count 1 != 2 replica count

VBucket 622: active count 9 != 11 replica count

VBucket 623: active count 5 != 6 replica count

VBucket 624: active count 5 != 6 replica count

VBucket 626: active count 7 != 11 replica count

VBucket 627: active count 3 != 5 replica count

VBucket 628: active count 6 != 10 replica count

VBucket 629: active count 4 != 6 replica count

VBucket 632: active count 0 != 1 replica count

VBucket 633: active count 0 != 1 replica count

VBucket 634: active count 7 != 10 replica count

VBucket 635: active count 3 != 4 replica count

VBucket 636: active count 8 != 10 replica count

VBucket 637: active count 5 != 6 replica count

VBucket 638: active count 5 != 6 replica count

VBucket 640: active count 2 != 4 replica count

VBucket 641: active count 7 != 11 replica count

VBucket 643: active count 5 != 7 replica count

VBucket 646: active count 3 != 5 replica count

VBucket 647: active count 7 != 10 replica count

VBucket 648: active count 4 != 6 replica count

VBucket 649: active count 8 != 10 replica count

VBucket 651: active count 0 != 1 replica count

VBucket 653: active count 4 != 6 replica count

VBucket 654: active count 3 != 4 replica count

VBucket 655: active count 7 != 10 replica count

VBucket 657: active count 4 != 5 replica count

VBucket 658: active count 2 != 4 replica count

VBucket 659: active count 7 != 11 replica count

VBucket 660: active count 3 != 5 replica count

VBucket 661: active count 7 != 10 replica count

VBucket 662: active count 0 != 2 replica count

VBucket 666: active count 4 != 6 replica count

VBucket 667: active count 8 != 10 replica count

VBucket 668: active count 3 != 4 replica count

VBucket 669: active count 7 != 10 replica count

VBucket 670: active count 1 != 2 replica count

VBucket 671: active count 2 != 3 replica count

VBucket 673: active count 0 != 1 replica count

VBucket 674: active count 3 != 4 replica count

VBucket 675: active count 7 != 10 replica count

VBucket 676: active count 5 != 6 replica count

VBucket 677: active count 8 != 10 replica count

VBucket 679: active count 5 != 6 replica count

VBucket 681: active count 6 != 7 replica count

VBucket 682: active count 3 != 5 replica count

VBucket 683: active count 8 != 12 replica count

VBucket 684: active count 3 != 6 replica count

VBucket 685: active count 7 != 11 replica count

VBucket 688: active count 3 != 4 replica count

VBucket 689: active count 7 != 10 replica count

VBucket 692: active count 1 != 2 replica count

VBucket 693: active count 2 != 3 replica count

VBucket 694: active count 5 != 6 replica count

VBucket 695: active count 8 != 10 replica count

VBucket 696: active count 3 != 5 replica count

VBucket 697: active count 8 != 12 replica count

VBucket 699: active count 4 != 5 replica count

VBucket 700: active count 0 != 1 replica count

VBucket 702: active count 3 != 6 replica count

VBucket 703: active count 7 != 11 replica count

VBucket 704: active count 3 != 5 replica count

VBucket 705: active count 8 != 12 replica count

VBucket 709: active count 4 != 5 replica count

VBucket 710: active count 3 != 6 replica count

VBucket 711: active count 7 != 11 replica count

VBucket 712: active count 3 != 4 replica count

VBucket 713: active count 7 != 10 replica count

VBucket 715: active count 3 != 4 replica count

VBucket 716: active count 1 != 2 replica count

VBucket 717: active count 0 != 2 replica count

VBucket 718: active count 5 != 6 replica count

VBucket 719: active count 8 != 10 replica count

VBucket 720: active count 0 != 1 replica count

VBucket 722: active count 3 != 5 replica count

VBucket 723: active count 8 != 12 replica count

VBucket 724: active count 3 != 6 replica count

VBucket 725: active count 7 != 11 replica count

VBucket 727: active count 5 != 7 replica count

VBucket 728: active count 2 != 4 replica count

VBucket 729: active count 3 != 5 replica count

VBucket 730: active count 3 != 4 replica count

VBucket 731: active count 7 != 10 replica count

VBucket 732: active count 5 != 6 replica count

VBucket 733: active count 8 != 10 replica count

VBucket 737: active count 3 != 4 replica count

VBucket 738: active count 4 != 6 replica count

VBucket 739: active count 8 != 10 replica count

VBucket 740: active count 3 != 4 replica count

VBucket 741: active count 7 != 10 replica count

VBucket 743: active count 0 != 1 replica count

VBucket 746: active count 2 != 4 replica count

VBucket 747: active count 7 != 11 replica count

VBucket 748: active count 3 != 5 replica count

VBucket 749: active count 7 != 10 replica count

VBucket 751: active count 3 != 4 replica count

VBucket 752: active count 4 != 6 replica count

VBucket 753: active count 9 != 11 replica count

VBucket 754: active count 1 != 2 replica count

VBucket 755: active count 4 != 5 replica count

VBucket 758: active count 3 != 4 replica count

VBucket 759: active count 7 != 10 replica count

VBucket 760: active count 2 != 4 replica count

VBucket 761: active count 7 != 11 replica count

VBucket 762: active count 0 != 1 replica count

VBucket 765: active count 6 != 7 replica count

VBucket 766: active count 3 != 5 replica count

VBucket 767: active count 7 != 10 replica count

VBucket 770: active count 3 != 5 replica count

VBucket 771: active count 7 != 11 replica count

VBucket 772: active count 4 != 6 replica count

VBucket 773: active count 6 != 10 replica count

VBucket 775: active count 3 != 4 replica count

VBucket 777: active count 3 != 4 replica count

VBucket 778: active count 3 != 4 replica count

VBucket 779: active count 7 != 10 replica count

VBucket 780: active count 5 != 6 replica count

VBucket 781: active count 8 != 10 replica count

VBucket 782: active count 1 != 2 replica count

VBucket 783: active count 0 != 2 replica count

VBucket 784: active count 3 != 5 replica count

VBucket 785: active count 7 != 11 replica count

VBucket 786: active count 0 != 1 replica count

VBucket 789: active count 4 != 6 replica count

VBucket 790: active count 4 != 6 replica count

VBucket 791: active count 6 != 10 replica count

VBucket 792: active count 3 != 4 replica count

VBucket 793: active count 8 != 11 replica count

VBucket 794: active count 2 != 4 replica count

VBucket 795: active count 4 != 6 replica count

VBucket 798: active count 5 != 6 replica count

VBucket 799: active count 8 != 10 replica count

VBucket 800: active count 4 != 6 replica count

VBucket 801: active count 8 != 10 replica count

VBucket 803: active count 3 != 4 replica count

VBucket 804: active count 0 != 1 replica count

VBucket 805: active count 0 != 1 replica count

VBucket 806: active count 3 != 4 replica count

VBucket 807: active count 7 != 10 replica count

VBucket 808: active count 3 != 4 replica count

VBucket 809: active count 6 != 10 replica count

VBucket 813: active count 4 != 5 replica count

VBucket 814: active count 4 != 5 replica count

VBucket 815: active count 7 != 10 replica count

VBucket 816: active count 1 != 2 replica count

VBucket 817: active count 4 != 5 replica count

VBucket 818: active count 4 != 6 replica count

VBucket 819: active count 8 != 10 replica count

VBucket 820: active count 3 != 4 replica count

VBucket 821: active count 7 != 10 replica count

VBucket 824: active count 0 != 1 replica count

VBucket 826: active count 3 != 4 replica count

VBucket 827: active count 6 != 10 replica count

VBucket 828: active count 4 != 5 replica count

VBucket 829: active count 7 != 10 replica count

VBucket 831: active count 6 != 7 replica count

VBucket 833: active count 4 != 6 replica count

VBucket 834: active count 3 != 4 replica count

VBucket 835: active count 6 != 10 replica count

VBucket 836: active count 4 != 5 replica count

VBucket 837: active count 7 != 10 replica count

VBucket 840: active count 0 != 1 replica count

VBucket 841: active count 0 != 1 replica count

VBucket 842: active count 4 != 6 replica count

VBucket 843: active count 8 != 10 replica count

VBucket 844: active count 3 != 4 replica count

VBucket 845: active count 7 != 10 replica count

VBucket 847: active count 4 != 6 replica count

VBucket 848: active count 3 != 4 replica count

VBucket 849: active count 6 != 10 replica count

VBucket 851: active count 3 != 4 replica count

VBucket 852: active count 0 != 2 replica count

VBucket 854: active count 4 != 5 replica count

VBucket 855: active count 7 != 10 replica count

VBucket 856: active count 4 != 6 replica count

VBucket 857: active count 8 != 10 replica count

VBucket 860: active count 1 != 2 replica count

VBucket 861: active count 3 != 4 replica count

VBucket 862: active count 3 != 4 replica count

VBucket 863: active count 8 != 11 replica count

VBucket 864: active count 3 != 4 replica count

VBucket 865: active count 7 != 10 replica count

VBucket 866: active count 0 != 1 replica count

VBucket 867: active count 0 != 1 replica count

VBucket 869: active count 5 != 6 replica count

VBucket 870: active count 5 != 6 replica count

VBucket 871: active count 8 != 10 replica count

VBucket 872: active count 3 != 5 replica count

VBucket 873: active count 7 != 11 replica count

VBucket 875: active count 5 != 6 replica count

VBucket 878: active count 4 != 6 replica count

VBucket 879: active count 6 != 10 replica count

VBucket 882: active count 3 != 4 replica count

VBucket 883: active count 7 != 10 replica count

VBucket 884: active count 5 != 6 replica count

VBucket 885: active count 9 != 11 replica count

VBucket 886: active count 1 != 2 replica count

VBucket 887: active count 3 != 4 replica count

VBucket 889: active count 3 != 4 replica count

VBucket 890: active count 3 != 5 replica count

VBucket 891: active count 7 != 11 replica count

VBucket 892: active count 4 != 6 replica count

VBucket 893: active count 6 != 10 replica count

VBucket 894: active count 0 != 1 replica count

VBucket 896: active count 8 != 10 replica count

VBucket 897: active count 4 != 6 replica count

VBucket 900: active count 2 != 3 replica count

VBucket 901: active count 2 != 3 replica count

VBucket 902: active count 7 != 10 replica count

VBucket 903: active count 3 != 4 replica count

VBucket 904: active count 7 != 11 replica count

VBucket 905: active count 2 != 4 replica count

VBucket 906: active count 4 != 5 replica count

VBucket 909: active count 0 != 2 replica count

VBucket 910: active count 7 != 10 replica count

VBucket 911: active count 3 != 5 replica count

VBucket 912: active count 0 != 1 replica count

VBucket 914: active count 8 != 10 replica count

VBucket 915: active count 4 != 6 replica count

VBucket 916: active count 7 != 10 replica count

VBucket 917: active count 3 != 4 replica count

VBucket 918: active count 4 != 6 replica count

VBucket 920: active count 5 != 7 replica count

VBucket 922: active count 7 != 11 replica count

VBucket 923: active count 2 != 4 replica count

VBucket 924: active count 7 != 10 replica count

VBucket 925: active count 3 != 5 replica count

VBucket 928: active count 4 != 5 replica count

VBucket 930: active count 8 != 12 replica count

VBucket 931: active count 3 != 5 replica count

VBucket 932: active count 7 != 11 replica count

VBucket 933: active count 3 != 6 replica count

VBucket 935: active count 0 != 1 replica count

VBucket 938: active count 7 != 10 replica count

VBucket 939: active count 3 != 4 replica count

VBucket 940: active count 8 != 10 replica count

VBucket 941: active count 5 != 6 replica count

VBucket 942: active count 2 != 3 replica count

VBucket 943: active count 1 != 2 replica count

VBucket 944: active count 8 != 12 replica count

VBucket 945: active count 3 != 5 replica count

VBucket 946: active count 6 != 7 replica count

VBucket 950: active count 7 != 11 replica count

VBucket 951: active count 3 != 6 replica count

VBucket 952: active count 7 != 10 replica count

VBucket 953: active count 3 != 4 replica count

VBucket 954: active count 0 != 1 replica count

VBucket 956: active count 5 != 6 replica count

VBucket 958: active count 8 != 10 replica count

VBucket 959: active count 5 != 6 replica count

VBucket 960: active count 7 != 10 replica count

VBucket 961: active count 3 != 4 replica count

VBucket 962: active count 3 != 5 replica count

VBucket 963: active count 2 != 4 replica count

VBucket 966: active count 8 != 10 replica count

VBucket 967: active count 5 != 6 replica count

VBucket 968: active count 8 != 12 replica count

VBucket 969: active count 3 != 5 replica count

VBucket 971: active count 0 != 1 replica count

VBucket 972: active count 5 != 7 replica count

VBucket 974: active count 7 != 11 replica count

VBucket 975: active count 3 != 6 replica count

VBucket 976: active count 3 != 4 replica count

VBucket 978: active count 7 != 10 replica count

VBucket 979: active count 3 != 4 replica count

VBucket 980: active count 8 != 10 replica count

VBucket 981: active count 5 != 6 replica count

VBucket 982: active count 0 != 2 replica count

VBucket 983: active count 1 != 2 replica count

VBucket 986: active count 8 != 12 replica count

VBucket 987: active count 3 != 5 replica count

VBucket 988: active count 7 != 11 replica count

VBucket 989: active count 3 != 6 replica count

VBucket 990: active count 4 != 5 replica count

VBucket 993: active count 0 != 1 replica count

VBucket 994: active count 7 != 11 replica count

VBucket 995: active count 2 != 4 replica count

VBucket 996: active count 7 != 10 replica count

VBucket 997: active count 3 != 5 replica count

VBucket 998: active count 5 != 6 replica count

VBucket 1000: active count 4 != 5 replica count

VBucket 1001: active count 1 != 2 replica count

VBucket 1002: active count 9 != 11 replica count

VBucket 1003: active count 4 != 6 replica count

VBucket 1004: active count 7 != 10 replica count

VBucket 1005: active count 3 != 4 replica count

VBucket 1008: active count 7 != 11 replica count

VBucket 1009: active count 2 != 4 replica count

VBucket 1012: active count 4 != 5 replica count

VBucket 1014: active count 7 != 10 replica count

VBucket 1015: active count 3 != 5 replica count

VBucket 1016: active count 8 != 10 replica count

VBucket 1017: active count 4 != 6 replica count

VBucket 1018: active count 3 != 4 replica count

VBucket 1020: active count 0 != 1 replica count

VBucket 1022: active count 7 != 10 replica count

VBucket 1023: active count 3 != 4 replica count

Active item count = 3500

Same at source
----------------------
Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.45:11210,172.23.106.46:11210
Active item count = 3500

Will attach cbcollect and data files.


 Comments   
Comment by Mike Wiederhold [ 15/Sep/14 ]
This is not a bug. We no longer do this because a replica vbucket cannot delete items on it's own due to dcp.
Comment by Aruna Piravi [ 15/Sep/14 ]
I do not understand why this is not a bug. This is a case where replica items = 4250 and active = 3500. Both were initially 5000 before warmup. However 50% of the actual deletes have happened on replica bucket(5000->4250). And so I would expect the another 750 items to be deleted too so active=replica. If this is not a bug, in case of failover, the cluster will end up having more items than it did before the failover.
Comment by Aruna Piravi [ 15/Sep/14 ]
> We no longer do this because a replica vbucket cannot delete items on it's own due to dcp
Then I would expect the deletes to be propagated from active vbuckets through dcp..but these never get propagated. If you do a cbdiff even now, you can see the mismatch.
Comment by Sriram Ganesan [ 17/Sep/14 ]
Aruna

If there is a testrunner script available for steps (1) - (5), please update the bug. Thanks.
Comment by Aruna Piravi [ 17/Sep/14 ]
Done.
Comment by Aruna Piravi [ 19/Sep/14 ]
On 3 runs in 3.0.1-1309 in same environment where I was able to consistently reproduce until build1307, I do not see this mismatch. I'm not sure if any recent check-in helped. It seems to me a tricky case that is visible is some builds but not others. In any case I request that we look at the logs in cases where we have reproduced this problem to ascertain the cause. Thanks.
Comment by Aruna Piravi [ 22/Sep/14 ]
Not seeing this in most recent build - 3.0.1-1313 either. Reducing severity. Will resolve once the cause is known.




[MB-7761] Add stats for all operations to memcached Created: 15/Feb/13  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.1, 2.0, 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Improvement Priority: Major
Reporter: Mike Wiederhold Assignee: Trond Norbye
Resolution: Unresolved Votes: 0
Labels: supportability
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Attachments: File stats-improvements.md    
Issue Links:
Duplicate
is duplicated by MB-11986 Stats for every operations. (prepend ... Resolved
Relates to
relates to MB-8793 Prepare spec on stats updates Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-5011 gat (get and touch) operation not rep... Technical task Open Mike Wiederhold  
MB-6121 More operation stats please Technical task Open Mike Wiederhold  
MB-7419 Disk reads for append/prepend/incr/de... Technical task Open Mike Wiederhold  
MB-7711 UI: Getandlock doesn't show up in any... Technical task Closed Mike Wiederhold  
MB-7807 aggregate all kinds of ops in ops/sec... Technical task Open Anil Kumar  
MB-8183 getAndTouch (and touch) operations ar... Technical task Resolved Aleksey Kondratenko  
MB-10377 getl and cas not reported in the GUI ... Technical task Open Aleksey Kondratenko  
MB-11655 Stats: Getandlock doesn't show up in ... Technical task Open Mike Wiederhold  

 Description   
Stats have increasingly been an issue to deal with since they are half done in memcached and half done in ep-engine. Memcached should simply handle connections and not really care or track anything operation related. This stuff should happen in the engines and memcached should just ask for it when it needs the info.

 Comments   
Comment by Tug Grall (Inactive) [ 01/May/13 ]
Just to be sure they are linked. I let the engr team chose how to deal with this JIRA
Comment by Perry Krug [ 07/Jul/14 ]
Raising awareness to this broad supportability issue which sometimes makes it hard for the field and customers to accurately understand their Couchbase traffic
Comment by Mike Wiederhold [ 03/Sep/14 ]
Trond,

I've attached the design document for this issue. Last we discussed you mentioned that you would take on the task of implementing this into memcached. Once your finished I will coordinate the rest of the ns_server/ep-engine changes.
Comment by Perry Krug [ 20/Sep/14 ]
Mike, just came across another situation I hope we can resolve with this.

It seems that in the current implementation, CAS operations (even on set) are not included within the cmd_set statistic. So while looking at the UI, even with very high load of set+CAS, there is nothing recorded in the graphs.

Given that with the binary protocol, CAS can be implemented on any operation, can we augment the way that we track all statistics to count an operation whether it has CAS or not as the underlying operation? And then also have the break-out statistics of "CAS operations/hits/misses" count for all operations that have a CAS supplied?

Thanks




[MB-12222] Duplicate existing cluster management ui using angularjs Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: sherlock
Fix Version/s: techdebt-backlog, sherlock
Security Level: Public

Type: Story Priority: Major
Reporter: Aleksey Kondratenko Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We're having difficulties maintaining current cells.js-based ui. Therefore to make our code base more accessible for wider audience of js developers we're working on rewriting js of our ui using mega-popular angularjs.




[MB-8668] Separate CPU hungry pieces (xdcr and/or views) out of ns_server erlang VM Created: 19/Jul/13  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: 3.0.1, sherlock
Security Level: Public

Type: Task Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Artem Stemkovski
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
XDCR can eat tons of resources. Sometimes maybe hogging cluster-management resources. Views can spent lots of CPU and IO too.

Plus we know that views are significantly faster if we don't run erlang with async io threads. And we currently have to because otherwise cluster management will not receive it's cpu cycles timely.


 Comments   
Comment by Dipti Borkar [ 21/Jul/13 ]
Alk, any reason why this is marked for 2.2?
Comment by Aleksey Kondratenko [ 23/Jul/13 ]
Dipti, I was under impression that we agreed to try to do this as part of 2.2.0. It would be awesome (i.e. some people are doing weird things and cause weird timeouts in management layer because of the way things are configured) if we could but given time constraints this might be impossible.
Comment by Aleksey Kondratenko [ 25/Jul/13 ]
As per Dipti this is least priority




[MB-11999] Resident ratio of active items drops from 3% to 0.06% during rebalance with delta recovery Created: 18/Aug/14  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1169

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File vb_active_resident_items_ratio.png     PNG File vb_replica_resident_items_ratio.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares-dev/45/artifact/
Is this a Regression?: No

 Description   
1 of 4 nodes is being re-added after failover.
500M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Most importantly it happens due to excessive memory usage.

 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
http://review.couchbase.org/#/c/41468/
Comment by Abhinav Dangeti [ 18/Sep/14 ]
Merged fix.




[MB-12197] Bucket deletion failing with error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue."} Created: 16/Sep/14  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Meenakshi Goel
Resolution: Fixed Votes: 0
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1299-rel

Attachments: Text File test.txt    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/32/consoleFull
http://qa.hq.northscale.net/job/win_2008_x64--59--01--bucket_flush-P1/14/console
http://qa.hq.northscale.net/job/win_2008_x64--59_01--warmup-P1/6/consoleFull

Test to Reproduce:
newmemcapable.GetrTests.getr_test,nodes_init=4,GROUP=P0,expiration=60,wait_expiration=true,error=Not found for vbucket,descr=#simple getr replica_count=1 expiration=60 flags = 0 docs_ops=create cluster ops = None
flush.bucketflush.BucketFlushTests.bucketflush,items=20000,nodes_in=3,GROUP=P0

*Note that test doesn't fail but further do fails with "error 400 reason: unknown ["Prepare join failed. Node is already part of cluster."]" because cleanup wasn't successful.

Logs:
[rebalance:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.6938.0>:ns_rebalancer:do_wait_buckets_shutdown:307]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.182',
                                                         {'EXIT',
                                                          {old_buckets_shutdown_wait_failed,
                                                           ["default"]}}}]

[error_logger:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.6938.0>
    registered_name: []
    exception exit: {buckets_shutdown_wait_failed,
                        [{'ns_1@10.3.121.182',
                             {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                     ["default"]}}}]}
      in function ns_rebalancer:do_wait_buckets_shutdown/1 (src/ns_rebalancer.erl, line 308)
      in call from ns_rebalancer:rebalance/5 (src/ns_rebalancer.erl, line 361)
    ancestors: [<0.811.0>,mb_master_sup,mb_master,ns_server_sup,
                  ns_server_cluster_sup,<0.57.0>]
    messages: []
    links: [<0.811.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 46422
    stack_size: 27
    reductions: 5472
  neighbours:

[user:info,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.811.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {buckets_shutdown_wait_failed,
                              [{'ns_1@10.3.121.182',
                                {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                  ["default"]}}}]}
[ns_server:error,2014-09-15T9:36:09.645,ns_1@10.3.121.182:ns_memcached-default<0.4908.0>:ns_memcached:terminate:798]Failed to delete bucket "default": {error,{badmatch,{error,closed}}}

Uploading Logs

 Comments   
Comment by Meenakshi Goel [ 16/Sep/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12197/11dd43ca/10.3.121.182-9152014-938-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/e7795065/10.3.121.183-9152014-940-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/6442301b/10.3.121.102-9152014-942-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/10edf209/10.3.121.107-9152014-943-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/9f16f503/10.1.2.66-9152014-945-diag.zip
Comment by Ketaki Gangal [ 16/Sep/14 ]
Assigning to ns_server team for a first look.
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
For cases like this it's very useful to get sample of backtraces from memcached on bad node. Is it still running ?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
Eh. It's windows....
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
I've merged diagnostics commit (http://review.couchbase.org/41463). Please rerun, reproduce and give me new set of logs.
Comment by Meenakshi Goel [ 18/Sep/14 ]
Tested with 3.0.1-1307-rel, Please find logs below.
https://s3.amazonaws.com/bugdb/jira/MB-12197/c2191900/10.3.121.182-9172014-2245-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/28bc4a83/10.3.121.183-9172014-2246-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/8f1efbe5/10.3.121.102-9172014-2248-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/91a89d6a/10.3.121.107-9172014-2249-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/2d272074/10.1.2.66-9172014-2251-diag.zip
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
BTW I am indeed quite interested if this is specific to windows or not.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
This continues to be superweird. Possibly another erlang bug. I need somebody to answer the following:

* can we reliably reproduce this on windows ?

* 100 % of the time ?

* if not (roughly) how often?

* can we reproduce this (at all) on GNU/Linux? How frequently?
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
No need to diagnose it any further. Thanks to Aliaksey we managed to understand this case and fix is going to be merged shortly.
Comment by Venu Uppalapati [ 18/Sep/14 ]
Here is my empirical observation for this issue:
1)I have the following inside a .bat script

C:\"Program Files"\Couchbase\Server\bin\couchbase-cli.exe bucket-delete -c 127.0.0.1:8091 --bucket=default -u Administrator -p password

C:\"Program Files"\Couchbase\Server\bin\couchbase-cli.exe rebalance -c 127.0.0.1:8091 --server-remove=172.23.106.180 -u Administrator -p password

2)I execute this script against a two node cluster with default bucket created, but with no data.

3)I see bucket deletion and rebalance fail in succession. This happened 4 times out of 4 trials.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
http://review.couchbase.org/41474
Comment by Meenakshi Goel [ 19/Sep/14 ]
Tested with 3.0.1-1309-rel and no longer seeing the issue.
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/34/console




[MB-11454] cbstats should have -j option to output results in JSON format Created: 17/Jun/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.2.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Kirk Kirkconnell Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It'd be nice to have the ability to have JSON be the output of cbstats. Something like being able to add a -j argument to the command.




[MB-11909] Couchbase user has a login shell Created: 08/Aug/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Patrick Varley Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Good security practices is not to have a shell for the user of a service.

patrick@mancouch:~$ getent passwd couchbase
couchbase:x:999:999:Couchbase system user:/opt/couchbase:/bin/sh

We should set the log on shell to be /bin/false or /bin/nologin

 Comments   
Comment by Bin Cui [ 22/Sep/14 ]
http://review.couchbase.org/#/c/41570/




[MB-12221] N1QL should return version information Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Major
Reporter: Cihan Biyikoglu Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
n1ql should have a version() function that return version information. this will be useful for cases where behavioral changes are implemented and apps can issue queries that is tuned to specific n1ql versions.

if n1ql_version()=1.0 query='...' else if n1ql_version()=2.0 query='+++'






[MB-12220] Add unique id generation functions to n1ql Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Major
Reporter: Cihan Biyikoglu Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
add unique id generation function to n1ql
new_uuid()




[MB-12150] [Windows] Cleanup unnecessary files that are part of the windows installer Created: 08/Sep/14  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 3.0.1-1261

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install windows build 3.0.1-1261
As part of the installation you will see 2 files couchbase_console.html and also membase_console.html. You do not need membase_console.html. Please remove it

 Comments   
Comment by Bin Cui [ 22/Sep/14 ]
http://review.couchbase.org/#/c/41567/




[MB-12217] Wrong parameter order in xdcr debug message Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Chris Malarky Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The following message:

[xdcr:debug,2014-09-19T12:29:31.255,ns_1@ec2-xxx-xxx-xxx.compute-1.amazonaws.com:<0.25112.24>:concurrency_throttle:handle_call:88]no token available (total tokens:<0.25337.24>), put (pid:32, signal: start_replication, targetnode: "ec2-yyy-yyy-yyy-yyy.us-west-2.compute. amazonaws.com:8092") into waiting pool (active reps: 32, waiting reps: 305)

Is generated by:

http://src.couchbase.org/source/xref/2.5.1/ns_server/src/concurrency_throttle.erl#88

The parameters Pid and TotalTokens on line 90 need to be swapped around.




[MB-11853] OS X app doesn't work if run from non-administrator user account Created: 30/Jul/14  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Jens Alfke Assignee: Bin Cui
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
A user on the mobile mailing list reports that the Mac OS Couchbase Server app doesn't work when it's run from a non-administrator OS user account:

"It doesn't start from my non-admin account - message is displayed "You can't open the application "%@" because it may be damaged or incomplete." Yet Couchbase starts successfully from if I start it from the command line using an administrator account. In addition not all the menu options under the Couchbase icon work, e.g. Open Admin Console or About Couchbase Server."

https://groups.google.com/d/msgid/mobile-couchbase/5f53905e-bad1-4e51-b146-9d99f774506b%40googlegroups.com?utm_medium=email&utm_source=footer

It sounds as though some of the files in the bundle may have admin-only permissions?

 Comments   
Comment by Anil Kumar [ 30/Jul/14 ]
If user wants to install Couchbase on MAC OSX as non-admin they can follow the steps here in our documentation for non-root/non-sudo user - http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#installing-on-mac-os-x-as-non-root-non-sudo
Comment by Jens Alfke [ 30/Jul/14 ]
If you have to go through a kludge like that to run it, that's a bug. Re-opening.

There's no reason this should require being an admin user. It doesn't copy anything into system directories.
Comment by Wayne Siu [ 01/Aug/14 ]
Per PM (Anil), it will not be included in the 3.0 release. Will review the priority at a later time.
Comment by Bin Cui [ 22/Sep/14 ]
I can install and run couchbase server from any account other than root without any problem. Cannot reproduce it anyway.




[MB-8014] Improve performance for cbdocloader Created: 06/Dec/12  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.1.0
Fix Version/s: bug-backlog, 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: PM-PRIORITIZED
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-11441 Loading the example buckets is extrem... Resolved

 Description   
Enhance cbdocloader in terms of performance. Not replacing it with cbimport. Use cbtransfer at the ba ck to improve the performance.


[we will have another bug for this stuff] json, csv, BSON support was mentioned by Dipti.

Also, taking care of the case of invocation by ns_server during the setup wizard process needs a thought-through story.

And, finally, keeping in mind a future cbexport tool is something to handle, too.

 Comments   
Comment by Perry Krug [ 17/Jun/14 ]
Would this automatically extend to cbworkloadgen or does that need to be addressed separately?
Comment by Trond Norbye [ 21/Jun/14 ]
Perry: I prototyped a version in go (just to finally have a reason to play around with go ;-) which you'll find at: https://github.com/trondn/go-cbdocloader

If you just copy that to "bin/tools/cbdocloader" you should get a way better performance for your loading ;) (and it should work with modifications in the cluster topology during loading of the sample data)
Comment by Steve Yen [ 30/Jun/14 ]
see also bugs MB-11441 and MB-10714
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Bin, Ashvinder, Wayne .. July 17th
Comment by Bin Cui [ 22/Sep/14 ]
cbdocloader was reimplemented with cbtransfer to improve throughput.




[MB-12218] DGM cluster saw "out of memory" errors from couchstore on vbucket snapshot path Created: 22/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Jim Walker Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: error-handling, memory
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Linux 3.2.0-68-virtual
[info] OS Version : Ubuntu 12.04.5 LTS
[info] CB Version : 3.0.0-1209-rel-enterprise

[info] Architecture : x86_64
[info] Virtual Host : Microsoft HyperV
[ok] Installed CPUs : 4
[ok] Installed RAM : 28140 MB
[ok] Used RAM : 69.9% (19658 / 28139 MB)

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: Some memcached.log files from cbase-43

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.5.txt
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.4.txt
Is this a Regression?: Unknown

 Description   
Raising this defect after looking at a large DGM cluster that had a stalled rebalance. It looks like some failures in couchstore (memory issues) lead to memcached termination and stall of the rebalance, whereas maybe the error could of been handled and ejection performed?

The cluster is a 4 node "large" scale cluster hosted in Azure. Cihan provided me access via a private key which I would rather people request from Cihan rather than me spreading the key around :) At the moment the cluster is stuck and there is historical logging data on a number of nodes indicating memory errors were caught, but lead to termination and I suspect the stall.

The tail end of the following file shows memory problems are detected and logged:
 
* http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.4.txt

Starting at 10:31 we see the following pattern.

Sat Sep 13 10:31:31.375401 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name=/data/couchbase/b1_full_ejection/1020.couch.1 option=1 rev=1 error=failed to allocate buffer [errno = 12: 'Cannot allocate memory']
Sat Sep 13 10:31:31.375461 UTC 3: (b1_full_ejection) Warning: failed to open database, name=/data/couchbase/b1_full_ejection/1020.couch.1020
Sat Sep 13 10:31:31.375474 UTC 3: (b1_full_ejection) Warning: failed to set new state, active, for vbucket 1020
Sat Sep 13 10:31:31.375398 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name= option=1 rev=1 error=failed to allocate buffer []
Sat Sep 13 10:31:31.375481 UTC 3: (b1_full_ejection) VBucket snapshot task failed!!! Rescheduling

And finally the file ends with:

Sat Sep 13 10:31:31.577731 UTC 3: (b1_full_ejection) nonio_worker_9: Exception caught in task "Checkpoint Remover on vb 189": std::bad_alloc

Next version of memcached.log is the following file which indicates that memcached was restarted:

* http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.5.txt

Sat Sep 13 10:32:29.783313 UTC 3: (b1_full_ejection) Trying to connect to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.787504 UTC 3: (b1_full_ejection) Connected to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.797130 UTC 3: (No Engine) Bucket b1_full_ejection registered with low priority
Sat Sep 13 10:32:29.797244 UTC 3: (No Engine) Spawning 4 readers, 4 writers, 1 auxIO, 1 nonIO threads
Sat Sep 13 10:32:30.100791 UTC 3: (b1_full_ejection) metadata loaded in 301 ms

cbcollect logs from 3 of 4 nodes (/tmp is tiny on node 41) which may be useful, but don't have the historical data from the live node as above)

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-43.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-42.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-40.zip

 Comments   
Comment by Jim Walker [ 22/Sep/14 ]
I'll take this unless there's an obvious dup or something already in the pipeline.




[MB-11917] One node slow probably due to the Erlang scheduler Created: 09/Aug/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash_toy_701.rtf     PNG File leto_ssd_300-1105_561_build_init_indexleto_ssd_300-1105_561172.23.100.31beam.smp_cpu.png    
Issue Links:
Duplicate
duplicates MB-12200 Seg fault during indexing on view-toy... Resolved
duplicates MB-9822 One of nodes is too slow during indexing Closed
is duplicated by MB-12183 View Query Thruput regression compare... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
One node is slow, that's probably due to the "scheduler collapse" bug in the Erlang VM R16.

I will try to find a way to verify that it is really the scheduler and no other problem. This is basically a duplicate of MB-9822. Though that bug has a long history, hence I dare to create a new one.

 Comments   
Comment by Volker Mische [ 09/Aug/14 ]
I forgot to add that our issue sounds exactly like that one: http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
Comment by Sriram Melkote [ 11/Aug/14 ]
Upgrading to blocker as this is doubling initial index time in recent runs on showfast.
Comment by Volker Mische [ 12/Aug/14 ]
I verified that it's the "scheduler collapse". Have a look at the chart I've attached (It's from [1] [172.23.100.31] beam.smp_cpu). It starts with a utilization of around 400% at around 120 I reduced the online schedulers to 1 (with running erlang:system_flag(schedulers_online, 1) via a remote shell). I then increased the schedulers_online again at around 150 to the original value of 24. You can see that it got back to normal.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1105_561_build_init_index
Comment by Volker Mische [ 12/Aug/14 ]
I would try to run on R16 and see how often it happens with COUCHBASE_NS_SERVER_VM_EXTRA_ARGS=["+swt", "low", "+sfwi", "100"] set (as suggested in MB-9822 [1]).

[1]: https://www.couchbase.com/issues/browse/MB-9822?focusedCommentId=89219&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-89219
Comment by Pavel Paulau [ 12/Aug/14 ]
We agreed to try:

+sfwi 100/500 and +sbwt long

Will run test 5 times with these options.
Comment by Pavel Paulau [ 13/Aug/14 ]
5 runs of tests/index_50M_dgm.test with -sfwi 100 -sbwt long:

http://ci.sc.couchbase.com/job/leto-dev/19/
http://ci.sc.couchbase.com/job/leto-dev/20/
http://ci.sc.couchbase.com/job/leto-dev/21/
http://ci.sc.couchbase.com/job/leto-dev/22/
http://ci.sc.couchbase.com/job/leto-dev/23/

3 normal runs, 2 with slowness.
Comment by Volker Mische [ 13/Aug/14 ]
I see only one slow run (22): http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_6a0_build_init_index

But still :-/
Comment by Pavel Paulau [ 13/Aug/14 ]
See (20), incremental indexing: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1137_ed9_build_incr_index
Comment by Volker Mische [ 13/Aug/14 ]
Oh, I was only looking at the initial building.
Comment by Volker Mische [ 13/Aug/14 ]
I got a hint in the #erlang IRC channel. I'll try to use the erlang:bump_reductions(2000) and see if that helps.
Comment by Volker Mische [ 13/Aug/14 ]
Let's see if bumping the reductions make it work: http://review.couchbase.org/40591
Comment by Aleksey Kondratenko [ 13/Aug/14 ]
merged that commit.
Comment by Pavel Paulau [ 13/Aug/14 ]
Just tested build 3.0.0-1150, rebalance test but with initial indexing phase.

2 nodes are super slow and utilize only single core.
Comment by Volker Mische [ 18/Aug/14 ]
I can't reproduce it locally. I tend towards closing this issue as "won't fix". We should really not have long running NIFS.

I also think that it won't happen much under real work loads. And even if, the workaround would be to reduce the number of online schedulers to 1 and immediately increasing it again back to the original number.
Comment by Volker Mische [ 18/Aug/14 ]
Assigning to Siri to make the call on whether we close it or not.
Comment by Anil Kumar [ 18/Aug/14 ]
Triage - Not blocking 3.0 RC1
Comment by Raju Suravarjjala [ 19/Aug/14 ]
Triage: Siri will put additional information and this bug is being retargeted to 3.0.1
Comment by Sriram Melkote [ 19/Aug/14 ]
Folks, for too long we've had trouble that get pinned to our NIFs. In 3.5, let's solve them whatever is the correct Erlang approach to running heavy high performance code. Port, or reporting reductions, or moving to R17 with dirty schedulers, or some other option I missed - whatever is the best solution, let us implement in 3.5 and be done.
Comment by Volker Mische [ 09/Sep/14 ]
I think we should close this issue and rather create a new one for whatever we come up with (e.g. the async mapreduce NIF).
Comment by Harsha Havanur [ 10/Sep/14 ]
Toy Build for this change at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-702-toy.deb

Review in progress at
http://review.couchbase.org/#/c/41221/4
Comment by Harsha Havanur [ 12/Sep/14 ]
Please find udpated toy build for this
http://latestbuilds.hq.couchbase.com/couchbase-server-community_ubunt12-3.0.0-toy-hhs-x86_64_3.0.0-704-toy.deb
Comment by Sriram Melkote [ 12/Sep/14 ]
Another occurrence of this, MB-12183.

I'm making this a blocker.
Comment by Harsha Havanur [ 13/Sep/14 ]
Centos build at
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-700-toy.rpm
Comment by Ketaki Gangal [ 16/Sep/14 ]
Filed bug MB-12200 for this toy-build
Comment by Ketaki Gangal [ 17/Sep/14 ]
Attaching stack from toy-build 701
File

crash_toy_701.rtf

Access to machine is as mentioned previously on MB-12200.
Comment by Harsha Havanur [ 19/Sep/14 ]
We are facing 2 issues with async nif implementation.
1) Loss of signals leading to deadlock in enqueue and dequeue in queues
I am suspecting enif mutex and condition variables. I could reproduce deadlock scenario on Centos which potentially point to both producer and consumer (enqueue and dequeue) in our case going to sleep due to not handling condition variable signals correctly.
To address this issue, I have replaced enif mutex and condition variables with that of C++ stl counterparts. This seem to fix the dead lock situation.

2) Memory getting freed by terminator task when the context is alive during mapDoc.
This is still work in progress and will update once I have a solution for this.
Comment by Harsha Havanur [ 21/Sep/14 ]
Segmentation fault is probably due to termination of erlang process calling map_doc. This triggers destructor which cleans up v8 context when the task is still in the queue. Will attempt a fix for this.
Comment by Harsha Havanur [ 22/Sep/14 ]
I have fixed both issues in this build
http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-hhs-x86_64_3.0.0-709-toy.rpm
I am running systests as ketki suggested on VMs 10.6.2.164, 165, 168, 171, 172, 194, 195. Currently rebalance is in progress.

For the deadlock situation resolution was to broadcast condition signal to wake up all waiting threads instead of waking up only one of the threads.
For Segmentation fault resolution was to complete map task for the context before it is cleaned up by destructor when erlang process calling map task terminates or crashes.

Please use this build for further functional and performance verification. Thanks,




[MB-9045] [windows] cbworkloadgen hungs Created: 03/Sep/13  Updated: 22/Sep/14  Resolved: 22/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.2.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Iryna Mironava Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: scrubbed, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.2.0-817
<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="862733cea3805cf8eba957a120a67986cd57e4e3"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="2a797a8d97f421587cce728f2e6aa2cd42c8fa26"/><project name="ep-engine" path="ep-engine" revision="864296f0b4068f9d8e3943fbea6e34c29cf0e903"/><project name="libconflate" path="libconflate" revision="c0d3e26a51f25a2b020713559cb344d43ce0b06c"/><project name="libmemcached" path="libmemcached" revision="ea579a523ca3af872c292b1e33d800e3649a8892" remote="membase"/><project name="libvbucket" path="libvbucket" revision="408057ec55da3862ab8d75b1ed25d2848afd640f"/><project name="couchbase-cli" path="couchbase-cli" revision="94b37190ece87b4386a93b64e62487370d268654" remote="couchbase"/><project name="memcached" path="memcached" revision="414d788f476a019cc5d2b05e0ce72504fe469c79" remote="membase"/><project name="moxi" path="moxi" revision="01bd2a5c0aff2ca35611ba3fb857198945cc84eb"/><project name="ns_server" path="ns_server" revision="8e533a59413ba98dd8a0bc31b409668ca886c560"/><project name="portsigar" path="portsigar" revision="2204847c85a3ccaecb2bb300306baf64824b2597"/><project name="sigar" path="sigar" revision="a402af5b6a30ea8e5e7220818208e2601cb6caba"/><project name="couchbase-examples" path="couchbase-examples" revision="cd9c8600589a1996c1ba6dbea9ac171b937d3379"/><project name="couchbase-python-client" path="couchbase-python-client" revision="f14c0f53b633b5313eca1ef64b0f241330cf02c4"/><project name="couchdb" path="couchdb" revision="386be73085c0b2a8e11cd771fc2ce367b62b7354"/><project name="couchdbx-app" path="couchdbx-app" revision="300031ab2e7e2fc20c59854cb065a7641e8654be"/><project name="couchstore" path="couchstore" revision="30f8f0872ef28f95765a7cad4b2e45e32b95dff8"/><project name="geocouch" path="geocouch" revision="000096996e57b2193ea8dde87e078e653a7d7b80"/><project name="healthchecker" path="healthchecker" revision="fd4658a69eec1dbe8a6122e71d2624c5ef56919c"/><project name="testrunner" path="testrunner" revision="8371aa1cc3a21650b3a9f81ba422ec9ac3151cfc"/><project name="cbsasl" path="cbsasl" revision="6ba4c36480e78569524fc38f6befeefb614951e6"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="44a584d1de8c89addfb4f1d0522bdbbbed83ba48" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>

Attachments: Zip Archive cbcollect.zip    
Triage: Untriaged
Operating System: Windows 64-bit

 Description   
/cygdrive/c/Program\ Files/Couchbase/Server/bin/cbworkloadgen.exe -n localhost:8091 -r 0.9 -i 1000 -b default -s 256 -j -t 2 -u Administrator -p password
loaded only 369 items and then just hungs

 Comments   
Comment by Bin Cui [ 03/Sep/13 ]
Looks like the parameter -s 256 causes the trouble, which is to create any doc with at least 256 byte.

When tested with s less than 50, it always works fine. But we will have trouble when it runs beyond this value.

BTW, the default one is 10 for s.
Comment by Thuan Nguyen [ 21/Jan/14 ]
Test on build 2.5.0-1054, cbworkloadgen.exe still hang with item size only 35 bytes

cbworkloadgen.exe -n 10.1.2.31:8091 -r 0.9 -i 1000000 -b default -s 35 -j -t 2 -u Administrator -p password

Comment by Thuan Nguyen [ 21/Jan/14 ]
check UI, it loads only 639 items and stopped at default bucket
Comment by Bin Cui [ 22/Sep/14 ]
Verified as:

c:\t1\bin>cbworkloadgen.exe -n localhost -i 1000 -s 1550 -j
  [####################] 100.1% (1053/estimated 1052 msgs)
bucket: default, msgs transferred...
       : total | last | per sec
 byte : 1690674 | 1690674 | 4594222.4
done
Comment by Bin Cui [ 22/Sep/14 ]
Verified build: 1314




[MB-12208] Security Risk: XDCR logs emit entire Document contents in a error situations Created: 17/Sep/14  Updated: 22/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.2.0, 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Critical
Reporter: Gokul Krishnan Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Per recent discussions with CFO and contract teams, we need to ensure that Customer's Data (Document Keys and Values) aren't emitted in the logs. This poses a security risk and we need default logging throttle levels that don't emit document data in readable format.

Support team have noticed this in the 2.2 version, verifying if this behavior still exists in 2.5.1.

Example posted in a private comment below

 Comments   
Comment by Patrick Varley [ 18/Sep/14 ]
At the same time we need the ability to increase the log level on the fly and include this information, when we hit a wall and need that extra information.

Summarise:

default setting: Do not expose customer data.

Increase logging on the fly that might include customer data. Which the support team will explain to the end-user.
Comment by Cihan Biyikoglu [ 22/Sep/14 ]
lets triage for 3.0.1




[MB-6972] distribute couchbase-server through yum and ubuntu package repositories Created: 19/Oct/12  Updated: 19/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: Anil Kumar Assignee: Phil Labee
Resolution: Unresolved Votes: 3
Labels: devX
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks MB-8693 [Doc] distribute couchbase-server thr... Reopened
blocks MB-7821 yum install couchbase-server from cou... Resolved
Duplicate
duplicates MB-2299 Create signed RPM's Resolved
is duplicated by MB-9409 repository for deb packages (debian&u... Resolved
Flagged:
Release Note

 Description   
this helps us in handling dependencies that are needed for couchbase server
sdk team has already implemented this for various sdk packages.

we might have to make some changes to our packaging metadata to work with this schema

 Comments   
Comment by Steve Yen [ 26/Nov/12 ]
to 2.0.2 per bug-scrub

first step is do the repositories?
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Steve Yen [ 26/Nov/12 ]
back to 2.01, per bug-scrub
Comment by Farshid Ghods (Inactive) [ 19/Dec/12 ]
Phil,
please sync up with Farshid and get instructions that Sergey and Pavel sent
Comment by Farshid Ghods (Inactive) [ 28/Jan/13 ]
we should resolve this task once 2.0.1 is released .
Comment by Dipti Borkar [ 29/Jan/13 ]
Have we figured out the upgrade process moving forward. for example from 2.0.1 to 2.0.2 or 2.0.1 to 2.1 ?
Comment by Jin Lim [ 04/Feb/13 ]
Please ensure that we also confirm/validate the upgrade process moving from 2.0.1 to 2.0.2. Thanks.
Comment by Phil Labee [ 06/Feb/13 ]
Now have DEB repo working, but another issue has come up: We need to distribute the public key so that users can install the key before running apt-get.

wiki page has been updated.
Comment by kzeller [ 14/Feb/13 ]
Added to 2.0.1 RN as:

Fix:

We now provide Couchbase Server as a yum and Debian package
repositories.
Comment by Matt Ingenthron [ 09/Apr/13 ]
What are the public URLs for these repositories? This was mentioned in the release notes here:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
Comment by Matt Ingenthron [ 09/Apr/13 ]
Reopening, since this isn't documented that I can find. Apologies if I'm just missing it.
Comment by Dipti Borkar [ 23/Apr/13 ]
Anil, can you work with Phil to see what are the next steps here?
Comment by Anil Kumar [ 24/Apr/13 ]
Yes I'll be having discussion with Phil and will update here with details.
Comment by Tim Ray [ 28/Apr/13 ]
could we either remove the note about yum/deb repo's in the release notes or get those repo locations / sample files / keys added to public pages? The only links that seem that they 'might' contain the info point to internal pages I don't have access to.
Comment by Anil Kumar [ 14/May/13 ]
thanks Tim, we have removed it from release notes. we will add instructions about yum/deb repo's locations/files/keys to documentation once its available. thanks!
Comment by kzeller [ 14/May/13 ]
Removing duplicate ticket:

http://www.couchbase.com/issues/browse/MB-7860
Comment by h0nIg [ 24/Oct/13 ]
any update? maybe i created a duplicate issue: http://www.couchbase.com/issues/browse/MB-9409 but it seems that the repositories are outdated on http://hub.internal.couchbase.com/confluence/display/CR/How+to+Use+a+Linux+Repo+--+debian
Comment by Sriram Melkote [ 22/Apr/14 ]
I tried to install on Debian today. It failed badly. One .deb package didn't match the libc version of stable. The other didn't match the openssl version. Changing libc or openssl is simply not an option for someone using Debian stable because it messes with the base OS too deeply. So as of 4/23/14, we don't have support for Debian.
Comment by Sriram Melkote [ 22/Apr/14 ]
Anil, we have accumulated a lot of input in this bug. I don't think this will realistically go anywhere for 3.0 unless we define specific goals and some considered platform support matrix expansion. Can you please create a goal for 3.0 more precisely?
Comment by Matt Ingenthron [ 22/Apr/14 ]
+1 on Siri's comments. Conversations I had with both Ubuntu (who recommend their PPAs) and Red Hat experts (who recommend setting up a repo or getting into EPEL or the like) indicated that's the best way to ensure coverage of all OSs. Binary packages built on one OS and deployed on another are risky, run into dependency issues.
Comment by Anil Kumar [ 28/Apr/14 ]
This ticket specially for distributing DEB and RPM repositories through YUM and APT repo. We have another ticket for supporting Debian platform MB-10960.
Comment by Anil Kumar [ 23/Jun/14 ]
Assigning ticket to Tony for verification.
Comment by Phil Labee [ 21/Jul/14 ]
Need to do before closing:

[ ] capture keys and process used for build that is currently posted (3.0.0-628), update tools and keys of record in build repo and wiki page
[ ] distribute 2.5.1 and 3.0.0-beta1 builds using same process, testing update capability
[ ] test update from 2.0.0 to 2.5.1 to 3.0.0
Comment by Phil Labee [ 21/Jul/14 ]
re-opening to assign to sprint to prepare the distribution repos for testing
Comment by Wayne Siu [ 30/Jul/14 ]
Phil,
has build 3.0.0-973 be updated in the repos for beta testing?
Comment by Wayne Siu [ 29/Aug/14 ]
Phil,
Please refresh it with build 3.0.0-1205. Thanks.
Comment by Phil Labee [ 04/Sep/14 ]
Due to loss of private keys used to post 3.0.0-628, created new key pairs. Upgrade testing was never done, so starting with 2.5.1 release version (2.5.1-1100).

upload and test using location http://packages.couchbase.com/linux-repos/TEST/:

  [X] ubuntu-12.04 x86_64
  [X] ubuntu-10.04 x86_64

  [X] centos-6-x86_64
  [X] centos-5-x86_64
Comment by Anil Kumar [ 04/Sep/14 ]
Phil / Wayne - Not sure whats happening here please clarify.
Comment by Wayne Siu [ 16/Sep/14 ]
Please refresh with the build 3.0.0-1209.
Comment by Phil Labee [ 17/Sep/14 ]
upgrade to 3.0.0-1209 using test location:

    s3://packages.couchbase.com/linux-repos/TEST/

  [X] ubuntu-12.04 x86_64
  [X] ubuntu-10.04 x86_64

  [X] centos-6-x86_64
  [X] centos-5-x86_64

Comment by Phil Labee [ 19/Sep/14 ]
now pushing 3.0.0-1209 to production location:

    s3://packages.couchbase.com/releases/couchbase-server/

  [X] centos-6-x86_64
  [X] centos-5-x86_64

    Please verify with instructions at: http://hub.internal.couchbase.com/confluence/display/CR/How+to+Download+from+a+Linux+Repo+--+RPM

  [X] ubuntu-12.04 x86_64
  [X] ubuntu-10.04 x86_64

    Please verify with instructions at: http://hub.internal.couchbase.com/confluence/display/CR/How+to+Download+from+a+Linux+Repo+--+Ubuntu

  [ ] debian-7-x86_64





[MB-12196] [Windows] When I run cbworkloadgen.exe, I see a Warning message Created: 15/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install 3.0.1_1299 build
Go to bin directory on the installation directory, run cbworkloadgen.exe
You will see the following warning:
WARNING:root:could not import snappy module. Compress/uncompress function will be skipped.

Expected behavior: The above warning should not appear


 Comments   
Comment by Bin Cui [ 19/Sep/14 ]
http://review.couchbase.org/#/c/41514/




[MB-12209] [windows] failed to offline upgrade from 2.5.x to 3.0.1-1299 Created: 18/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 r2 64-bit

Attachments: Zip Archive 12.11.10.145-9182014-1010-diag.zip     Zip Archive 12.11.10.145-9182014-922-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 2.5.1 on one node
Create default bucket
Load 1000 items to bucket
Offline upgrade from 2.5.1 to 3.0.1-1299
After upgrade, node reset to initial setup


 Comments   
Comment by Thuan Nguyen [ 18/Sep/14 ]
I got the same issue when offline upgrade from 2.5.0 to 3.0.1-1299. Updated the title
Comment by Thuan Nguyen [ 18/Sep/14 ]
cbcollectinfo of node failed to offline upgrade from 2.5.0 to 3.0.1-1299
Comment by Bin Cui [ 18/Sep/14 ]
http://review.couchbase.org/#/c/41473/




[MB-12216] XDCR@next release - simplified end-to-end test with kvfeed, router and xmem Created: 19/Sep/14  Updated: 19/Sep/14

Status: In Progress
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: feature-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 24h
Time Spent: Not Specified
Original Estimate: 24h

Epic Link: XDCR next release




[MB-12019] XDCR@next release - Replication Manager #1: barebone Created: 19/Aug/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: techdebt-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Done Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 32h
Time Spent: Not Specified
Original Estimate: 32h

Epic Link: XDCR next release

 Description   
build on top of generic FeedManager with XDCR specifics
1. interface with Distributed Metadata Service
2. interface with NS-server




[MB-10496] Investigate other possible memory allocators that provide the better fragmentation management Created: 18/Mar/14  Updated: 19/Sep/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0, 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Task Priority: Critical
Reporter: Chiyoung Seo Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Dependency
depends on MB-11756 segfault in ep.dylib`Checkpoint::queu... Closed
Duplicate
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-12067 Investigate explicit defragmentation ... Technical task Open Dave Rigby  

 Description   
As tcmalloc incurs a significant memory fragmentation for particular load patterns (e.g., append / prepend operations), we need to investigate the other options that have much less fragmentation overhead for those load patterns.

 Comments   
Comment by Matt Ingenthron [ 19/Mar/14 ]
I'm not an expert in this area any more, but I would say that my history with allocators is that there is often a tradeoff between performance aspects and space efficiency. My own personal opinion is that it may be better to not be tied to any one memory allocator, but rather have the right abstractions so we can use one or more.

I can certainly say that the initial tc_malloc integration was perhaps a bit hasty, driven by Sharon. The problem we were trying to solve at the time was actually a glibstdc++ bug on CentOS 5.2. It could have been fixed by upgrading to CentOS 5.3, but for a variety of reasons we were trying to find another workaround or solution. tc_malloc was integrated for that.

It was then that I introduced the lib/ directory and changed the compilation to set the RPATH. The reason I did this is I was trying to avoid our shipping tc_malloc, as at the time Ubuntu didn't include it since there were bugs. That gave me enough pause to think we may not want to be the first people to use tc_malloc in this particular way.

In particular, there's no reason to believe tc_malloc is best for windows. It may also not be best for platforms like mac OS and solaris/smartOS (in case we ever get there).
Comment by Matt Ingenthron [ 19/Mar/14 ]
By the way, those comments are just all history in case it's useful. Please discount or ignore it as appropriate. ;)
Comment by Chiyoung Seo [ 19/Mar/14 ]
Thanks Matt for good comments. As you mentioned, we plan to support more than one memory allocator, so that users can choose the allocator based on their OS and workload patterns. I know that there several open source projects and think we can start with investigating them first, and then need to develop our own allocator if necessary.
Comment by Matt Ingenthron [ 19/Mar/14 ]
No worries. You'll need a benchmark or two to evaluate things. Even then, some people will probably prefer something space efficient versus time efficient, but we won't be able to support everything, etc. If it were me, I'd look to support the OS shipped advanced allocator and maybe one other, as long as they met my test criteria of course.
Comment by Dave Rigby [ 24/Mar/14 ]
Adding some possible candidates:

* jemalloc (https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) - Note used personally but know some of the guys at FB who use it. Reportedly has good fragmentation properties.
Comment by Chiyoung Seo [ 06/May/14 ]
Trond will build a quick prototype that is based on a slabber on top of the slabber to see if that shows a better fragmentation management. He will share his ideas and initial results later.
Comment by Steve Yen [ 28/May/14 ]
Hi Trond,
Any latest thoughts / news on this?
-- steve
Comment by Aleksey Kondratenko [ 09/Jul/14 ]
There's also some recently opensources work by aerospike folks to track jemalloc allocations apparently similarly to how we're doing it with tcmalloc.
Comment by Dave Rigby [ 10/Jul/14 ]
@Alk: You got a link? I scanned back in aerospike's github fork (https://github.com/aerospike/jemalloc) for the last ~2 years but didn't see anything likely in there...
Comment by Aleksey Kondratenko [ 10/Jul/14 ]
It is sibling project (mentioned on their server's readme): https://github.com/aerospike/asmalloc
Comment by Dave Rigby [ 21/Jul/14 ]
I've taken a look at asmalloc from Aerospike. Some notes for the record:

asmalloc isn't actually used for the "product-level" memory quota management - it's more along the lines of a debug tool which can be configured to report (via a callback interface) when allocations over certain sizes occur and/or when memory reaches certain levels.

The git repo documentation eludes to the fact that it can be LD_PRELOADed by default (in an inactive mode) and enabled on-demand, but the pre-built Vagrant VM I download from their website didn't have it loaded, so I suspect it is more of a developer tool than a production feature.

In terms of implementation, asmalloc is just defines it's own malloc / free symbols and relies on LD_PRELOAD / dlsym with RTLD_NEXT to interpose it's symbols in front of the real malloc library. I'd note however that this isn't directly supported on Windows (which isn't a problem for Aerospike as they only support Linux).


Their actual tracking of memory for "namespaces" (aka buckets) is done by simple manual counting - incrementing and decrementing atomic counters when documents are added / removed / resized.
Comment by Dave Rigby [ 07/Aug/14 ]
Update on progress:

I've constructed a pathologically bad workload for the memory allocator (PathoGen) and run this on TCMalloc, JEMalloc and TCMalloc with aggressive decommit. Details at: https://docs.google.com/document/d/1sgE9LFfT5ZD4FbSZqCuUtzOoLu5BFi1Kf63R9kYuAyY/edit#

TL;DR:

* I can demonstrate TCMalloc having significantly higher RSS than the actual dataset, *and* holding onto this memory after the workload has decreased. JEMalloc doesn't have these problems.
Interestingly, enabling TCMALLOC_AGGRESSIVE_DECOMMIT makes TCMalloc behave very close to Jemalloc (i.e. RSS tracks workload, *and* there is minimal general overhead).

Further investigation is needed to see the implications of "aggressive decommit", particularly any negative performance implications, or if there is a middle-ground setting.
Comment by Aleksey Kondratenko [ 07/Aug/14 ]
Interesting. But if I understand correctly this is actually not worst possible workload. If you increase items size and overwrite all documents, then smaller size classes are more or less completely freed (logically, but internally at least tcmalloc will hold some in thread caches). I think you can make it worse by leaving significant count (say 1-2%) per size class allocated (i.e. by not overwriting some docs). This is where I expect both jemalloc and tcmalloc to be worse than just plain glibc malloc (which is worth testing too; but be sure to check if it has any major revisions I think it had so RHEL5 vs. RHEL6 vs. RHEL7 might give you different behavior).
Comment by Dave Rigby [ 07/Aug/14 ]
@Alk: Agreed - and in fact I was planning on adding a separate workload which did essentially that. I say as a separate benchmark as there are a few, arguably orthogonal "troublesome" aspects for allocators.

The pathoGen workload previously mentioned was trying to investigate the RSS overhead that occurs when allocators "hang on" to memory after the application has finished with it - and also look at the overhead associated with this - my intent was to model what customers V and F have seen when the OOM-killer takes out memcached when large amount s of memory are in the various TCMalloc free lists.

For the size-class problem which you describe, I believe ultimately this /will/ require some form of active defragmentation inside memcached/ep_engine - regardless of the allocation policy at object-creation time, you cannot predict what objects will/won't still be in memory later on.

I hope to hack together a "pyramid scheme" workload like you describe tomorrow, and we can see how the various allocators behave.

Comment by Dave Rigby [ 07/Aug/14 ]
@Alk: Additionally, I have some other numbers with revAB_sim (heavy-append workload, building a reverse address book) here: https://docs.google.com/spreadsheets/d/1JhsmpvRXGS9hmks2sY-Pz8obCllpOHPoVBsQ6J4ZuDg/edit#gid=0

They are less conclusive, but do show that TCMalloc gains a speedup (compared to jemalloc), at the cost of RSS size, particularly when looking at jemalloc [narenas:1] which is probably the more sensible configuration given our alloc/free of documents from different threads.

Arguably the most interesting thing is the terrible performance of glibc (Ubuntu 14.04) in terms of RSS usage...
Comment by Dave Rigby [ 12/Aug/14 ]
Toy build with "aggressive decommit" enabled: http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent64-3.0.0-toy-daver-x86_64_3.0.0-700-toy.rpm
Comment by Dave Rigby [ 15/Aug/14 ]
Note aforementioned toy build suffered from the TCMALLOC -DSMALL_BUT_SLOW bug (MB-11961), so the results are kinda moot. Having said that we can do exactly the same test without a modified build, by simply setting the env var TCMALLOC_AGGRESSIVE_DECOMMIT=t.
Comment by Dave Rigby [ 15/Aug/14 ]
I've conducted a further set of tests using PathoGen, specifically I've expanded it to address some of the feedback from people on this thread by adding a "frozen" mode where a small percentage of documents are frozen at a given size after each iteration. Results are here: https://docs.google.com/document/d/1sgE9LFfT5ZD4FbSZqCuUtzOoLu5BFi1Kf63R9kYuAyY/edit#heading=h.55etcgxxrj3a

Most interesting is probably the graph on page seven (direct link: https://docs.google.com/spreadsheets/d/1JhsmpvRXGS9hmks2sY-Pz8obCllpOHPoVBsQ6J4ZuDg/edit#gid=319463280)

I won't repeat the full analysis from the doc, but suffice to say that either TCMalloc with the "aggressive decommit" or jemalloc show significantly reduced memory overhead compared to the current default we use.
Comment by Aleksey Kondratenko [ 15/Aug/14 ]
See also http://www.couchbase.com/issues/browse/MB-11974
Comment by Dave Rigby [ 19/Sep/14 ]
memcached:

http://review.couchbase.org/41485 - jemalloc: Implement release_free_memory
http://review.couchbase.org/41486 - jemalloc: Report free {un,}mapped size
http://review.couchbase.org/41487 - Add 'enable_thread_cache' call to hooks API
http://review.couchbase.org/41488 - Add alloc_hooks API to mock server
http://review.couchbase.org/41489 - Add get_mapped_bytes() and release_free_memory() to testHarness
http://review.couchbase.org/41490 - jemalloc: Implement mem_used tracking using experimental hooks
http://review.couchbase.org/41491 - Add run_defragmenter_task to ENGINE API.

ep_engine:

http://review.couchbase.org/41494 - MB-10496 [1/6]: Initial version of HashTable defragmenter
http://review.couchbase.org/41495 - MB-10496 [2/6]: Implement run_defragmenter_task for ep_engine
http://review.couchbase.org/41496 - MB-10496 [3/6]: Unit test for degragmenter task
http://review.couchbase.org/41497 - MB-10496 [4/6]: Add epoch field to Blob; use as part of defragmenter policy
http://review.couchbase.org/41498 - MB-10496 [5/6]: pause/resume visitor support for epStore & HashTable
http://review.couchbase.org/41499 - MB-10496 [6/6]: Use pause/resume visitor for defragmenter task
Comment by Cihan Biyikoglu [ 19/Sep/14 ]
discussed this with David H, David R and Chiyoung. lets shoot for sherlock




[MB-12214] Move Sqoop provider to DCP Created: 19/Sep/14  Updated: 19/Sep/14

Status: Open
Project: Couchbase Server
Component/s: DCP
Affects Version/s: 3.0
Fix Version/s: sherlock
Security Level: Public

Type: Improvement Priority: Major
Reporter: Cihan Biyikoglu Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tracking item for moving sqoop over to DCP.





[MB-12215] Monitor the opening files or file discriptors via REST and create alert at certain threshold Created: 19/Sep/14  Updated: 19/Sep/14

Status: Open
Project: Couchbase Server
Component/s: RESTful-APIs
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Major
Reporter: Larry Liu Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
can we have a feature to monitor the opening files or file discriptors via REST and create alert at certain threshold?




[MB-12138] {Windows - DCP}:: View Query fails with error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"} Created: 05/Sep/14  Updated: 19/Sep/14  Resolved: 19/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Nimish Gupta
Resolution: Fixed Votes: 0
Labels: windows, windows-3.0-beta
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1267, Windows 2012, 64 x, machine:: 172.23.105.112

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-12138/172.23.105.112-952014-1511-diag.zip
Is this a Regression?: Yes

 Description   


1. Create 1 Node cluster
2. Create default bucket and add 100k items
3. Create views and query it

Seeing the following exceptions

http://172.23.105.112:8092/default/_design/ddoc1/_view/default_view0?connectionTimeout=60000&full_set=true&limit=100000&stale=false error 500 reason: error {"error":"error","reason":"{index_builder_exit,89,<<>>}"}

We cannot run any view tests as a result


 Comments   
Comment by Anil Kumar [ 16/Sep/14 ]
Nimish/Siri - Any update on this.
Comment by Meenakshi Goel [ 17/Sep/14 ]
Seeing similar issue in Views DGM test http://qa.hq.northscale.net/job/win_2008_x64--69_06_view_dgm_tests-P1/1/console
Test : view.createdeleteview.CreateDeleteViewTests.test_view_ops,ddoc_ops=update,test_with_view=True,num_ddocs=4,num_views_per_ddoc=10,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction
Comment by Nimish Gupta [ 17/Sep/14 ]
We have found the root cause and working on the fix.
Comment by Nimish Gupta [ 19/Sep/14 ]
http://review.couchbase.org/#/c/41480




[MB-10662] _all_docs is no longer supported in 3.0 Created: 27/Mar/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10649 _all_docs view queries fails with err... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
As of 3.0, view engine will no longer support the special predefined view, _all_docs.

It was not a published feature, but as it has been around for a long time, it is possible it was actually utilized in some setups.

We should document that _all_docs queries will not work in 3.0

 Comments   
Comment by Cihan Biyikoglu [ 27/Mar/14 ]
Thanks. are there internal tools depending on this? Do you know if we have deprecated this in the past? I realize it isn't a supported API but want to make sure we keep the door open for feedback during beta from large customers etc.
Comment by Perry Krug [ 28/Mar/14 ]
We have a few (very few) customers who have used this. They've known it is unsupported...but that doesn't ever really stop anyone if it works for them.

Do we have a doc describing what the proposed replacement will look like and will that be available for 3.0?
Comment by Ruth Harris [ 01/May/14 ]
_all_docs is not mentioned anywhere in the 2.2+ documentation. Not sure how to handle this. It's not deprecated because it was never intended for use.
Comment by Perry Krug [ 01/May/14 ]
I think at the very least a prominant release not is appropriate.
Comment by Gerald Sangudi [ 17/Sep/14 ]
For N1QL, please advise customers to do

CREATE PRIMARY INDEX on --bucket-name--.




[MB-9656] XDCR destination endpoints for "getting xdcr stats via rest" in url encoding Created: 29/Nov/13  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 3.0, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Blocker
Reporter: Patrick Varley Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://docs.couchbase.com/couchbase-manual-2.2/#getting-xdcr-stats-via-rest


 Description   
In our documentation the destination endpoint are not in url encoding where "/" are "%2F". This has mislead customers. That section should be in the following format:

replications%2F[UUID]%2F[source_bucket]%2F[destination_bucket]%2Fdocs_written

If this change is made we should remove this line too:

You need to provide properly URL-encoded /[UUID]/[source_bucket]/[destination_bucket]/[stat_name]. To get the number of documents written:



 Comments   
Comment by Amy Kurtzman [ 16/May/14 ]
The syntax and example code in this whole REST section needs to be cleaned up and tested. It is a bigger job than just fixing this one.
Comment by Patrick Varley [ 17/Sep/14 ]
I fall down this hole again and so do another Support Engineer. We really need to get this fixed in all versions.

The 3.0 documentation has this problem too.
Comment by Ruth Harris [ 17/Sep/14 ]
Why are you suggesting that the backslash in the syntax be %2F???
This is not a blocker.
Comment by Patrick Varley [ 18/Sep/14 ]
I believe this is a blocker has it has consumed a large amount of support's time on 3 separate occasion now. That also mean 3 separate end-users have had issues with this documentation.

Because if you use backslashes it does not work. Look at the examples further down the page.

This url works:
curl -u admin:password http://localhost:8091/pools/default/buckets/default/stats/replications%2F8ba6870d88cd72b3f1db113fc8aee675%2Fsource_bucket%2Fdestination_bucket%2Fdocs_written

This url does not:
 curl -u admin:password http://localhost:8091/pools/default/buckets/default/stats/replications/8ba6870d88cd72b3f1db113fc8aee675/source_bucket/destination_bucket/docs_written

That is pretty hard to workout from our documentation.




[MB-10961] Reverse iteration Created: 24/Apr/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Jens Alfke Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would make life easier for (some) clients if ForestDB supported a reverse iteration option. This would simply return the same documents but in reverse order.

Both Couchbase Lite and Couchbase Server have APIs that require the ability to reverse the order of query results. I'm currently implementing this for Couchbase Lite.
* It complicates my logic since I have to add a layer of indirection over the ForestDB iterator;
* It's less memory efficient because I have to buffer up all the results in memory;
* It's less parallelizable because I can't return any results to the caller until the entire iteration is complete.

 Comments   
Comment by Sundar Sridharan [ 18/Sep/14 ]
fix uploaded for review http://review.couchbase.org/41476 thanks




[MB-12213] Get the couchbase-server_src.tar.gz for 3.0.0 Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Wayne Siu Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-12176] Missing port number on the network ports documentation for 3.0 Created: 12/Sep/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Ruth Harris [ 16/Sep/14 ]
Network Ports section of the Couchbase Server 3.0 beta doc has been updated with the new ssl port, 11207, and the table with the details for all of the ports has been updated.

http://docs.couchbase.com/prebuilt/couchbase-manual-3.0/Install/install-networkPorts.html
The site (and network ports section) should be refreshed soon.

thanks, Ruth




[MB-12193] Docs should explicitly state that we don't support online downgrades in the installation guide Created: 15/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Gokul Krishnan Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In the installation guide, we should call out the fact that online downgrades (from 3.0 to 2.5.1) isn't supported and downgrades will require servers to be taken offline.

 Comments   
Comment by Ruth Harris [ 15/Sep/14 ]
In the 3.0 documentation:

Upgrading >
<note type="important">Online downgrades from 3.0 to 2.5.1 is not supported. Downgrades require that servers be taken offline.</note>

Should this be in the release notes too?
Comment by Matt Ingenthron [ 15/Sep/14 ]
"online" or "any"?
Comment by Ruth Harris [ 18/Sep/14 ]
Talked to Raju (QE) and all online downgrades are not supported. This is not a behavior change and is not appropriate for the core documentation. Removed the note from the upgrading section. Please advise whether this should be explicitly stated for all downgrades.

--Ruth




[MB-12212] Update AMI for Sync Gateway to 1.0.2 Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.2.0
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Jessica Liu Assignee: Wei-Li Liu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sync Gateway was updated to 1.0.2. The AMIs related to Sync Gateway need to be updated including:

Sync Gateway Enterprise, only: https://aws.amazon.com/marketplace/pp/B00M28SG0E/ref=sp_mpg_product_title?ie=UTF8&sr=0-2

Couchbase Server + Sync Gateway Community: https://aws.amazon.com/marketplace/pp/B00FA8DO50/ref=sp_mpg_product_title?ie=UTF8&sr=0-5




[MB-12199] curl -H arguments need to use double quotes Created: 16/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Current documentation states:

Indicates that an HTTP PUT operation is requested.
-H 'Content-Type: application/json'

And that will fail, seemingly owing to the single quotes. See also:
https://twitter.com/RamSharp/status/511739806528077824


 Comments   
Comment by Ruth Harris [ 16/Sep/14 ]
TASK for TECHNICAL WRITER
Fix in 3.0 == FIXED: Added single quotes or removed quotes from around the http string in appropriate examples.
Design Doc rest file - added single quotes, Compaction rest file ok, Trbl design doc file ok

FIX in 2.5: TBD

-----------------------

CONCLUSION:
At least with PUT, both single and double quotes work around: Content-Type: application/json. Didn't check GET or DELETE.
With PUT and DELETE, no quotes and single quotes around the http string work. Note: Some of the examples are missing a single quote around the http string. Meaning, one quote is present, but either the ending or beginning quote is missing. Didn't check GET.

Perhaps a missing single quote around the http string was the problem?
Perhaps there was formatting tags associated with ZlatRam's byauth.ddoc code that was causing the problem?

----------------------

TEST ONE:
1. create a ddoc and view from the UI = testview and testddoc
2. retrieve the ddoc using GET
3. use single quotes around Content-Type: application/json and around the http string. Note: Some of the examples are missing single quotes around the http string.
code: curl -X GET -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_testddoc'
results: {
    "views": {
        "testview": {
            "map": "function (doc, meta) {\n emit(meta.id, null);\n}"
        }
    }
}

TEST TWO:
1. delete testddoc
2. use single quotes around Content-Type: application/json and around the http string
code: curl -X DELETE -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_testddoc'
results: {"ok":true,"id":"_design/dev_testddoc"}
visual check via UI: Yep, it's gone


TEST THREE:
1. create a myauth.ddoc text file using the code in the Couchbase design doc documentation page.
2. Use PUT to create a dev_myauth design doc
3. use single quotes around Content-Type: application/json and around the http string. Note: I used "| python -m json.tool" to get pretty print output

myauth.ddoc contents: {"views":{"byloc":{"map":"function (doc, meta) {\n if (meta.type == \"json\") {\n emit(doc.city, doc.sales);\n } else {\n emit([\"blob\"]);\n }\n}"}}}
code: curl -X PUT -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_myauth' -d @myauth.ddoc | python -m json.tool
results: {
    "id": "_design/dev_myauth",
    "ok": true
}
visual check via UI: Yep, it's there.

TEST FOUR:
1. copy myauth.ddoc to zlat.ddoc
2. Use PUT to create a dev_zlat design doc
3. use double quotes around Content-Type: application/json and single quotes around the http string.

zlat.ddoc contents: {"views":{"byloc":{"map":"function (doc, meta) {\n if (meta.type == \"json\") {\n emit(doc.city, doc.sales);\n } else {\n emit([\"blob\"]);\n }\n}"}}}
code: curl -X PUT -H "Content-Type: application/json" 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlat' -d @zlat.ddoc | python -m json.tool
results: {
    "id": "_design/dev_zlat",
    "ok": true
}
visual check via UI: Yep, it's there.


TEST FIVE:
1. create a ddoc text file using ZlatRam's ddoc code
2. flattened the formatting so it reflected the code in the Couchbase example (used above)
3. Use PUT and single quotes.

zlatram contents: {"views":{"byauth":{"map":"function (doc, username) {\n if (doc.type == \"session\" && doc.user == username && Date.Parse(doc.expires) > Date.Parse(Date.Now()) ) {\n emit(doc.token, null);\n }\n}"}}}
code: curl -X PUT -H 'Content-Type: application/json' 'http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram' -d @zlatram.ddoc | python -m json.tool
results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
visual check via UI: Yep, it's there.

TEST SIX:
1. delete zlatram ddoc but without quotes around the http string: curl -X DELETE -H 'Content-Type: application/json' http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram
2. results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
3. verify via UI: Yep, it gone
4. add zlatram but without quotes around the http string: curl -X PUT -H 'Content-Type: application/json' http://Administrator:password@10.5.2.54:8092/test/_design/dev_zlatram
5. results: {
    "id": "_design/dev_zlatram",
    "ok": true
}
6. verify via UI: Yep, it back.




[MB-12090] add stale=false semantic changes to dev guide Created: 28/Aug/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged
Is this a Regression?: No

 Description   
Need to change the dev guide to explain the semantics change with the stale parameter.

 Comments   
Comment by Matt Ingenthron [ 28/Aug/14 ]
I could not find the 3.0 dev guide to write up something. I've generated a diff based on the 2.5 dev guide. Note that much of that dev guide refers to the 3.0 admin guide section on views. I could not find that in the "dita" directory so I could contribute a change to the XML. I think based on this and what I put in MB-12052 should help.


diff --git a/content/couchbase-devguide-2.5/finding-data-with-views.markdown b/content/couchbase-devguide-2.5/finding-data-with-views.markdown
index 77735b9..811dff0 100644
--- a/content/couchbase-devguide-2.5/finding-data-with-views.markdown
+++ b/content/couchbase-devguide-2.5/finding-data-with-views.markdown
@@ -1,6 +1,6 @@
 # Finding Data with Views
 
-In Couchbase 2.1.0 you can index and query JSON documents using *views*. Views
+In Couchbase you can index and query JSON documents using *views*. Views
 are functions written in JavaScript that can serve several purposes in your
 application. You can use them to:
 
@@ -323,16 +323,25 @@ Forinformation about the sort order of indexes, see the
 [Couchbase Server Manual](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/).
 
 The real-time nature of Couchbase Server means that an index can become outdated
-fairly quickly when new entries and updates occur. Couchbase Server generates
-the index when it is queried, but in the meantime more data can be added to the
-server and this information will not yet be part of the index. To resolve this,
-Couchbase SDKs and the REST API provide a `stale` parameter you use when you
-query a view. With this parameter you can indicate you will accept the most
-current index as it is, you want to trigger a refresh of the index and retrieve
-these results, or you want to retrieve the existing index as is but also trigger
-a refresh of the index. For instance, to query a view with the stale parameter
-using the Ruby SDK:
+fairly quickly when new entries and updates occur. Couchbase Server updates
+the index at the time the query is received if you supply the argument
+`false` to the `stale` parameter.
+
+<div class="notebox">
+<p>Note</p>
+<p>Starting with the 3.0 release, the "stale" view query argument
+"false" has been enhanced so it will consider all document changes
+which have been received at the time the query has been received. This
+means that use of the `durability requirements` or `observe` feature
+to block for persistence in application code before issuing the
+`false` stale query is no longer needed. It is recommended that you
+remove all such application level checks after completing the upgrade
+to the 3.0 release.
+</p>
+</div>
 
+For instance, to query a view with the stale parameter
+using the Ruby SDK:
 
 ```
 doc.recent_posts(:body => {:stale => :ok})
@@ -905,13 +914,14 @@ for(ViewRow row : result) {
 }
 ```
 
-Before we create a Couchbase client instance and connect to the server, we set a
-system property 'viewmode' to 'development' to put the view into production
-mode. Then we query our view and limit the number of documents returned to 20
-items. Finally when we query our view we set the `stale` parameter to FALSE to
-indicate we want to reindex and include any new or updated beers in Couchbase.
-For more information about the `stale` parameter and index updates, see Index
-Updates and the Stale Parameter in the
+Before we create a Couchbase client instance and connect to the
+server, we set a system property 'viewmode' to 'development' to put
+the view into production mode. Then we query our view and limit the
+number of documents returned to 20 items. Finally when we query our
+view we set the `stale` parameter to FALSE to indicate we want to
+consider any recent changes to documents. For more information about
+the `stale` parameter and index updates, see Index Updates and the
+Stale Parameter in the
 [Couchbase Server Manual](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#couchbase-views-writing-stale).
 
 The last part of this code sample is a loop we use to iterate through each item




[MB-12158] erlang gets stuck in gen_tcp:send despite socket being closed (was: Replication queue grows unbounded after graceful failover) Created: 09/Sep/14  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dcp_proxy.beam    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After speaking with Mike briefly, sounds like this may be a known issue. My apologies if there is a duplicate issue already filed.

Logs are here:
 https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-176-128-88.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-193-231-33.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-111-249.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-84-241.us-west-1.compute.amazonaws.com.zip

 Comments   
Comment by Mike Wiederhold [ 10/Sep/14 ]
Perry,

The stats seem to be missing for dcp streams so I cannot look further into this. If you can still reproduce this on 3.0 build 1209 then assign it back to me and include the logs.
Comment by Perry Krug [ 11/Sep/14 ]
Mike, does the cbcollect_info include these stats or do you need me to gather something specifically when the problem occurs?

If not, let's also get them included for future builds...
Comment by Perry Krug [ 11/Sep/14 ]
Hey Mike, I'm having a hard time reproducing this on build 1209 where it seemed rather easy on previous builds. Do you think any of the changes from the "bad_replicas" bug would have affected this? Is it worth reproducing on a previous build where it was easier in order to get the right logs/stats or do you think it may be fixed already?
Comment by Mike Wiederhold [ 11/Sep/14 ]
This very well could be related to MB-12137. I'll take a look at the cluster and if I don't find anything worth investigating further then I think we should close this as cannot reproduce since it doesn't seem to happen anymore on build 1209. If there is still a problem I'm sure it will be reproduced again later in one of our performance tests.
Comment by Mike Wiederhold [ 11/Sep/14 ]
It looks like one of the dcp connections to the failed over node was still active. My guess is that the node when down and came back up quickly. As a result it's possible that ns_server re-established the connection with the downed node. Can you attach the logs and assign this to Alk so he can take a look?
Comment by Perry Krug [ 11/Sep/14 ]
Thanks Mike.

Alk, logs are attached from the first time this was reproduced. Let me know if you need me to do so again.

Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Mike, btw for the future, if you could post exact details (i.e. node and name of connection) of stuff you want me to double-check/explain it could have saved me time.

Also, let me note that it's replica and node master who establishes replication. I.e. we're "pulling" rather than "pushing" replication.

I'll look at all this and see if I can find something.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Sorry, replica instead of master, who initiates replication.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Indeed I'm seeing dcp connection from memcached on .33 to beam of .88. And it appears that something in dcp replicator is stuck. I'll need a bit more time to figure this out.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Looks like socket send gets blocked somehow despite socket actually being closed already.

Might be serious enough to be a show stopper for 3.0.

Do you by any chance still have nodes running? Or if not, can you easily reproduce this? Having direct access to bad node might be very handy to diagnose this further.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Moved back to 3.0. Because if it's indeed erlang bug it might be very hard to fix and because it may happen not just during failover.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage - need and update pls.
Comment by Perry Krug [ 12/Sep/14 ]
I'm reproducing now and will post both the logs and the live systems momentarily
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Able to reproduce this condition with erlang outside of our product (which is great news):

* connect gen_tcp socket to nc or irb process listening

* spawn erlang process that will send stuff infinitely on that socket and will eventually block

* from erlang console do gen_tcp:close (i.e. while other erlang process is blocked writing)

* observe how erlang process that's blocked is still blocked

* observe with lsof that socket isn't really closed

* close the socket on the other end (by killing nc)

* observe with lsof that socket is closed

* observe how erlang process is still blocked (!) despite underlying socket fully dead

The fact that it's not a race is really great because dealing with deterministic bug (even if it's "feature" from erlang's point of view) is much easier
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Fix is at: http://review.couchbase.org/41396

I need approval to get this in 3.0.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Attaching fixed dcp_proxy.beam if somebody wants to be able to test the fix without waiting for build
Comment by Perry Krug [ 12/Sep/14 ]
Awesome as usual Alk, thanks very much.

I'll give this a try on my side for verification.
Comment by Parag Agarwal [ 12/Sep/14 ]
Alk, will this issue occur in TAP as well? during upgrades.
Comment by Mike Wiederhold [ 12/Sep/14 ]
Alk,

I apologize or not including a better description of what happened. In the future I'll make sure to leave better details before assigning bugs to others so that we don't have multiple people duplicating the same work.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> Alk, will this issue occur in TAP as well? during upgrades.

No.
Comment by Perry Krug [ 12/Sep/14 ]
As of yet unable to reproduce this on build 1209+dcp_proxy.beam.

Thanks for the quick turnaround Alk.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage discussion:
under load this may happen frequently -
there is good chance that this recovers itself in few mins - it should but we should validate.
if we are in this state, we can restart erlang to get out of the situation - no app unavailability required
fix could be risky to take at this point

decision: not taking this for 3.0
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Mike, need you ACK on this:

Because of dcp nops between replicators, dcp producer should after few minutes, close his side of the socket and release all resources.

Am I right? I said this in meeting just few minutes ago and it affected decision. If I'm wrong (say if you decided to disable nops in the end, or if you know it's broken etc), then we need to know it.
Comment by Perry Krug [ 12/Sep/14 ]
FWIW, I have seen that this does not recover after a few minutes. However, I agree that it is workaround-able both by restarting beam or bringing the node back into the cluster. Unless we think this will happen much more often, I agree it could be deferred out of 3.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Well if it does not recover then it can be argued that we have another bug on ep-engine side that may lead to similar badness (queue size and resources eated) _without_ clean workaround.

Mike, we'll need your input on DCP NOPs.
Comment by Mike Wiederhold [ 12/Sep/14 ]
I was curious about this myself. As far as I know the noop code is working properly and we have some tests to make sure it is. I can work with Perry to try to figure out what is going on on the ep-engine side and see if the noops are actually being sent. I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.

I can rule this out. We do have connection between destination's beam and source's memcached. And we _dont_ have connection to beam's connection to destination memcached anymore. Erlang is stuck writing to dead socket. So there's no way you could get nop acks back.
Comment by Perry Krug [ 15/Sep/14 ]
I've confirmed that this state persists for much longer than a few minutes...I've not ever seen it recover itself, and have left it to run for 15-20 minutes at least.

Do you need a live system to diagnose?
Comment by Cihan Biyikoglu [ 15/Sep/14 ]
thanks for the update - Mike, sounds like we should open an issue for DCP to reliably detect these conditions. We should add this in for 3.0.1.
Perry, Could you confirm restarting the erlang process resolves the issue Perry?
thanks
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41410

Mike will open different ticket for NOPs in DCP.




[MB-11998] Working set is screwed up during rebalance with delta recovery (>95% cache miss rate) Created: 18/Aug/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1169

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File cache_miss_rate.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares-dev/45/artifact/
Is this a Regression?: No

 Description   
1 of 4 nodes is being re-added after failover.
500M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

 Comments   
Comment by Abhinav Dangeti [ 17/Sep/14 ]
Warming up during the delta recovery without an access log seems to be the cause for this.
Comment by Abhinav Dangeti [ 18/Sep/14 ]
Venu, my suspicion here is that there was no access log generated during the course of this test. Can you set the access log task time to zero, and its sleep interval to say 5-10 minutes and retest this scenario? I think you will need to be using the performance framework to be able to plot the cache miss ratio.




[MB-12210] xdcr related services sometimes log debug and error messages to non-xdcr logs (was: XDCR Error Logging Improvement) Created: 18/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Chris Malarky Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: logging
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
When debugging an XDCR issue some very useful information was in the ns_server.error.log but not the ns_server.xdcr_errors.log

ns_server.xdcr_errors.log:

[xdcr:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8020.1657>:xdc_vbucket_rep:init_replication_state:496]Error in fetching remot bucket, error: timeout,sleep for 30 secs before retry.
[xdcr:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8021.1657>:xdc_vbucket_rep:init_replication_state:503]Error in fetching remot bucket, error: all_nodes_failed, msg: <<"Failed to grab remote bucket `wi_backup_bucket_` from any of known nodes">>sleep for 30 secs before retry

ns_server.error.log:

[ns_server:error,2014-09-18T7:02:12.674,ns_1@ec2-XX-XX-XX-XX.compute-1.amazonaws.com:<0.8022.1657>:remote_clusters_info: do_mk_json_get:1460]Request to http://Administrator:****@10.x.x.x:8091/pools failed:
{error,rest_error,
       <<"Error connect_timeout happened during REST call get to http://10.x.x.x:8091/pools.">>,
       {error,connect_timeout}}
[ns_server:error,2014-09-18T7:02:12.674,ns_1@ec2-xx-xx-xx-xx.compute-1.amazonaws.com:remote_clusters_info<0.20250.6>: remote_clusters_info:handle_info:435]Failed to grab remote bucket `wi_backup_bucket_`: {error,rest_error,
                                                   <<"Error connect_timeout happened during REST call get to http://10.x.x.x:8091/pools.">>,
                                                   {error,connect_timeout}}

Is there any way these messages could appear in with the xdcr_errors.log ?

 Comments   
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
Yes. Valid request. And some of that but not all has been addressed in 3.0.
Comment by Aleksey Kondratenko [ 18/Sep/14 ]
Good candidate for 3.0.1 but not necessarily important enough. I.e. in light of ongoing rewrite.




[MB-12185] update to "couchbase" from "membase" in gerrit mirroring and manifests Created: 14/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-8297 Some key projects are still hosted at... Open

 Description   
One of the key components of Couchbase is still only at github.com/membase and not at github.com/couchbase. I think it's okay to mirror to both locations (not that there's an advantage), but for sure it should be at couchbase and the manifest for Couchbase Server releases should be pointing to Couchbase.

I believe the steps here are as follows:
- Set up a github.com/couchbase/memcached project (I've done that)
- Update gerrit's commit hook to update that repository
- Change the manifests to start using that repository

Assigning this to build as a component, as gerrit is handled by the build team. Then I'm guessing it'll need to be handed over to Trond or another developer to do the manifest change once gerrit is up to date.

Since memcached is slow changing now, perhaps the third item can be done earlier.

 Comments   
Comment by Chris Hillery [ 15/Sep/14 ]
Actually manifests are owned by build team too so I will do both parts.

However, the manifest for the hopefully-final release candidate already exists, and I'm a teensy bit wary about changing it after the fact. The manifest change may need to wait for 3.0.1.
Comment by Matt Ingenthron [ 15/Sep/14 ]
I'll leave it to you to work out how to fix it, but I'd just point out that manifest files are mutable.
Comment by Chris Hillery [ 15/Sep/14 ]
The manifest we build from is mutable. The historical manifests recording what we have already built really shouldn't be.
Comment by Matt Ingenthron [ 15/Sep/14 ]
True, but they are. :) That was half me calling back to our discussion about tagging and mutability of things in the Mountain View office. I'm sure you remember that late night conversation.

If you can help here Ceej, that'd be great. I'm just trying to make sure we have the cleanest project possible out there on the web. One wart less will bring me to 999,999 or so. :)
Comment by Trond Norbye [ 15/Sep/14 ]
Just a FYI, we've been ramping up the changes to memcached, so it's no longer a slow moving component ;-)
Comment by Matt Ingenthron [ 15/Sep/14 ]
Slow moving w.r.t. 3.0.0 though, right? That means the current github.com/couchbase/memcached probably has the commit planned to be released, so it's low risk to update github.com/couchbase/manifest with the couchbase repo instead of membase.

That's all I meant. :)
Comment by Trond Norbye [ 15/Sep/14 ]
_all_ components should be slow moving with respect to 3.0.0 ;)
Comment by Chris Hillery [ 16/Sep/14 ]
Matt, it appears that couchbase/memcached is a *fork* of membase/memcached, which is probably undesirable. We can actively rename the membase/memcached project to couchbase/memcached, and github will automatically forward requests from the old name to the new so it is seamless. It also means that we don't have to worry about migrating any commits, etc.

Does anything refer to couchbase/memcached already? Could we delete that one outright and then rename membase/memcached instead?
Comment by Matt Ingenthron [ 16/Sep/14 ]
Ah, that would be my fault. I propose deleting the couchbase/memcached and then transferring ownership from membase/memcached to couchbase/memcached. I think that's what you meant by "actively rename", right? Sounds like a great plan.

I think that's all in your hands Ceej, but I'd be glad to help if needed.

I still think in the interest of reducing warts, it'd be good to fix the manifest.
Comment by Chris Hillery [ 16/Sep/14 ]
I will do that (rename the repo), just please confirm explicitly that temporarily deleting couchbase/memcached won't cause the world to end. :)
Comment by Matt Ingenthron [ 16/Sep/14 ]
It won't since it didn't exist until this last Sunday when I created this ticket. If something world-ending happens as a result, I'll call it a bug to have depended on it. ;)
Comment by Chris Hillery [ 18/Sep/14 ]
I deleted couchbase/memcached and then transferred ownership of membase/memcached to couchbase. The original membase/memcached repository had a number of collaborators, most of which I think were historical. For now, couchbase/memcached only has "Owners" and "Robots" listed as collaborators, which is generally the desired configuration.

http://review.couchbase.org/#/c/41470/ proposes changes to the active manifests. I see no problem with committing that.

As for the historical manifests, there are two:

1. Sooner or later we will add a "released/3.0.0.xml" manifest to the couchbase/manifest repository, representing the exact SHAs which were built. I think it's probably OK to retroactively change the remote on that manifest since the two repositories are aliases for each other. This will affect any 3.0.0 hotfixes which are built, etc.

2. However, all of the already-built 3.0 packages (.deb / .rpm / .zip files) have embedded in them the manifest which was used to build them. Those, unfortunately, cannot be changed at this time. Doing so would require re-packaging the deliverables which have already undergone QE validation. While it is technically possible to do so, it would be a great deal of manual work, and IMHO a non-trivial and unnecessary risk. The only safe solution would be to trigger a new build, but in that case I would argue we would need to re-validate the deliverables, which I'm sure is a non-starter for PM. I'm afraid this particular sub-wart will need to wait for 3.0.1 to be fully addressed.
Comment by Matt Ingenthron [ 18/Sep/14 ]
Excellent, thanks Ceej. I think this is a great improvement-- espeically if 3.0.0's release manifest no longer references membase.

I'll leave it to the build team to manage, but I might suggest that gerrit and various other things pointing to membase should slowly change as well, in case someone decides someday to cancel the membase organization subscription to github.




[MB-4593] Windows Installer hangs on "Computing Space Requirements" Created: 27/Dec/11  Updated: 18/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.0-developer-preview-3, 2.0-developer-preview-4
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Bin Cui Assignee: Don Pinto
Resolution: Unresolved Votes: 3
Labels: windows, windows-3.0-beta, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7 Ultimate 64. Sony Vaio, i3 with 4GB RAM and 200 GB of 500 GB free. Also on a Sony Vaio, Windows 7 Ultimate 64, i7, 6 GB RAM and a 750GB drive with about 600 GB free.

Attachments: PNG File couchbase-installer.png     PNG File image001.png     PNG File ss 2014-08-28 at 4.16.09 PM.png    
Triage: Triaged

 Description   
When installing the Community Server 2.0 DP3 on Windows, the installer hangs on the "Computing space requirements screen." There is no additional feedback from the installer. After 90-120 minutes or so, it does move forward and complete. The same issue was reported on Google Groups a few months back - http://groups.google.com/group/couchbase/browse_thread/thread/37dbba592a9c150b/f5e6d80880f7afc8?lnk=gst&q=msi.

Executable: couchbase-server-community_x86_64_2.0.0-dev-preview-3.setup.exe

WORKAROUND IN 3.0 - Create a registry key HKLM\SOFTWARE\Couchbase, name=SkipVcRuntime, type=DWORD, value=1 to skip installing VC redistributable installation which is causing this issue. If VC redistributable is necessary, it must be installed manually if the registry key is set to skip automatic install of it.


 Comments   
Comment by Filip Stas [ 23/Feb/12 ]
Is there any solution for this? I'm experiencing the same problem. Running the unpacked msi does not seem to work because the Installshield setup has been configured to require to install through the exe.

Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
from Bin:

Looks like it is related to installshield engine. Maybe installshield tries to access system registry and it is locked by other process. The suggestion is to shut down other running programs and try again if such problem pops up.
Comment by Farshid Ghods (Inactive) [ 22/Mar/12 ]
we were unable to reproduce this on windows 2008 64-bit

the bug mentions this happened on windows 7 64-bit which is not a supported platform but that should not make any difference
Comment by Farshid Ghods (Inactive) [ 23/Mar/12 ]
From Bin:

Windows 7 is my dev environment. And I have no problem to install and test it. From your description, I cannot tell whether it is failed during the installation or after installation finishes but couchcbase server cannot start.
 
If it is due to installshield failure, you can generate the log file for debugging as:
setup.exe /debuglog"C:\PathToLog\setupexe.log"
 
If Couchbase server fails to start, the most possible reason is due to missing or incompatible Microsoft runtime library. You can manually service_start.bat under bin directory and check what is going on. And you can run cbbrowse_log.bat to generate log file for further debugging.
Comment by John Zablocki (Inactive) [ 23/Mar/12 ]
This is an installation only problem. There's not much more to it other than the installer hangs on the screen (see attachment).

However, after a failed install, I did get it to work by:

a) deleting C:\Program Files\Couchbase\*

b) deleting all registry keys with Couchbase Server left over from the failed install

c) rebooting

Next time I see this problem, I'll run it again with the /debuglog

I think the problem might be that a previous install of DP3 or DP4 (nightly build) failed and left some bits in place somewhere.
Comment by Steve Yen [ 05/Apr/12 ]
from Perry...
Comment by Thuan Nguyen [ 05/Apr/12 ]
I can not repo this bug. I test on Windows 7 Professional 64 bit and Windows Server 2008 64 bit.
Here are steps:
- Install couchbase server 2.0.0r-388 (dp3)
- Open web browser and go to initial setup in web console.
- Uninstall couchbase server 2.0.0r-388
- Install couchbase server 2.0.0dp4r-722
- Open web browser and go to initial setup in web console.
Install and uninstall couchbase server go smoothly without any problem.
Comment by Bin Cui [ 25/Apr/12 ]
Maybe we need to get the installer verbose log file to get some clues.

setup.exe /verbose"c:\temp\logfile.txt"
Comment by John Zablocki (Inactive) [ 06/Jul/12 ]
Not sure if this is useful or not, but without fail, every time I encounter this problem, simply shutting down apps (usually Chrome for some reason) causes the hanging to stop. Right after closing Chrome, the C++ redistributable dialog pops open and installation completes.
Comment by Matt Ingenthron [ 10/Jul/12 ]
Workarounds/troubleshooting for this issue:


On installshield's website, there are similar problems reported for installshield. There are several possible reasons behind it:

1. The installation of the Microsoft C++ redistributable is blocked by some other running program, sometimes Chrome.
2. There are some remote network drives that are mapped to local system. Installshield may not have enough network privileges to access them.
3. Couchbase server was installed on the machine before and it was not totally uninstalled and/or removed. Installshield tried to recover from those old images.

To determine where to go next, run setup with debugging mode enabled:
setup.exe /debuglog"C:\temp\setupexe.log"

The contents of the log will tell you where it's getting stuck.
Comment by Bin Cui [ 30/Jul/12 ]
Matt's explanation should be included in document and Q&A website. I reproduced the hanging problem during installation if Chrome browser is running.
Comment by Farshid Ghods (Inactive) [ 30/Jul/12 ]
so does that mean the installer should wait until chrome and other browsers are terminated before proceeding ?

i see this as a very common use case with many installers that they ask the user to stop those applications and if user does not follow the instructions the set up process does not continue until these conditions are met.
Comment by Dipti Borkar [ 31/Jul/12 ]
Is there no way to fix this? At the least we need to provide an error or guidance that chrome needs to be quit before continuing. Is chrome the only one we have seen causing this problem?
Comment by Steve Yen [ 13/Sep/12 ]
http://review.couchbase.org/#/c/20552/
Comment by Steve Yen [ 13/Sep/12 ]
See CBD-593
Comment by Øyvind Størkersen [ 17/Dec/12 ]
Same bug when installing 2.0.0 (build-1976) on Windows 7. Stopping Chrome did not help, but killing the process "Logitech ScrollApp" (KhalScroll.exe) did..
Comment by Joseph Lam [ 13/Sep/13 ]
It's happening to me when installing 2.1.1 on Windows 7. What is this step for and it is really necessary? I see that it happens after the files have been copied to the installation folder. No entirely sure what it's computing space requirements for.
Comment by MikeOliverAZ [ 16/Nov/13 ]
Same problem on 2.2.0x86_64. I have tried everything, closing down chrome and torch from Task Manager to ensure no other apps are competing. Tried removing registry entries but so many, my time please. As is noted above this doesn't seem to be preventing writing the files under Program Files so what's it doing? So I cannot install, it now complains it cannot upgrade and run the installer again.

BS....giving up and going to MongoDB....it installs no sueat.

Comment by Sriram Melkote [ 18/Nov/13 ]
Reopening. Testing on VMs is a problem because they are all clones. We miss many problems like these.
Comment by Sriram Melkote [ 18/Nov/13 ]
Please don't close this bug until we have clear understanding of:

(a) What is the Runtime Library that we're trying to install that conflicts with all these other apps
(b) Why we need it
(c) A prioritized task to someone to remove that dependency on 3.0 release requirements

Until we have these, please do not close the bug.

We should not do any fixes on the lines of checking for known apps that conflict etc, as that is treating the symptom and not fixing the cause.
Comment by Bin Cui [ 18/Nov/13 ]
We install window runtime library because erlang runtime libraries depend on it. Not any runtime library, but the one that comes with erlang distribution package. Without it or with incompatible versions, erl.exe won't run.

In stead of checking any particular applications, the current solution is:
Run a erlang test script. If it runs correctly, no runtime library installed. Otherwise, installer has to install the runtime library.

Please see CBD-593.

Comment by Sriram Melkote [ 18/Nov/13 ]
My suggestion is that let us not attempt to install MSVCRT ourselves.

Let us check the library we need is present or not prior to starting the install (via appropriate registry keys).

If it is absent, let us direct the user to download and install it and exit.
Comment by Bin Cui [ 18/Nov/13 ]
The approach is not totally right. Even if the msvcrt exists, we still need to install it. Here the key is the absolute same msvrt package that comes with erlang distribution. We had problems before that with the same version, but different build of msvcrt installed, erlang won't run.

One possible solution is to ask user to download the msvcrt library from our website and make it a prerequisite for installing couchbase server.
Comment by Sriram Melkote [ 18/Nov/13 ]
OK. It looks like MS distributes some versions of VC runtime with the OS itself. I doubt that Erlang needs anything newer.

So let us rebuild Erlang and have it link to the OS supplied version of MSVCRT (i.e., msvcr70.dll) in Couchbase 3.0 onwards

In the meanwhile, let us point the user to the vcredist we ship in Couchbase 2.x versions and ask them to install it from there.
Comment by Steve Yen [ 23/Dec/13 ]
Saw this in the email inboxes...

From: Tal V
Date: December 22, 2013 at 1:19:36 AM PST
Subject: Installing Couchbase on Windows 7

Hi CouchBase support,
I would like to get your assist on an issue I’m having. I have a windows 7 machine on which I tried to install Couchbase, the installation is stuck on the “Computing space requirements”.
I tried several things without success:

1. 1. I tried to download a new installation package.

2. 2. I deleted all records of the software from the Registry.

3. 3. I deleted the folder that was created under C:\Program Files\Couchbase

4. 4. I restart the computer.

5. 5. Opened only the installation package.

6. 6. Re-install it again.
And again it was stuck on the same step.
What is the solution for it?

Thank you very much,


--
Tal V
Comment by Steve Yen [ 23/Dec/13 ]
Hi Bin,
Not knowing much about installshield here, but one idea - are there ways of forcibly, perhaps optionally, skipping the computing space requirements step? Some environment variable flag, perhaps?
Thanks,
Steve

Comment by Bin Cui [ 23/Dec/13 ]
This "Computing space requirements" is quite misleading. It happens at the post install step while GUI still shows that message. Within the step, we run the erlang test script and fails and the installer runs "vcredist.exe" for microsoft runtime library which gets stuck.

For the time being, the most reliable way is not to run this vcredist.exe from installer. Instead, we should provide a link in our download web site.

1. During installation, if we fails to run the erlang test script, we can pop up a warning dialog and ask customers to download and run it after installation.
 
Comment by Bin Cui [ 23/Dec/13 ]
To work around the problem, we can instruct the customer to download the vcredist.exe and run it manually before set up couchbase server. If running environment is set up correctly, installer will bypass that step.
Comment by Bin Cui [ 30/Dec/13 ]
Use windows registry key to install/skip the vcredist.exe step:

On 32bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Couchbase\SkipVcRuntime
On 64bit windows, Installer will check HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Couchbase\SkipVcRuntime,
where SkipVcRuntime is a DWORD (32-bit) value.

When SkipVcRuntime is set to 1, installer will skip the step to install vcredist.exe. Otherwise, installer will follow the same logic as before.
vcredist_x86.exe can be found in the root directory of couchbase server. It can be run as:
c:\<couchbase_root>\vcredist_x86.exe

http://review.couchbase.org/#/c/31501/
Comment by Bin Cui [ 02/Jan/14 ]
Check into branch 2.5 http://review.couchbase.org/#/c/31558/
Comment by Iryna Mironava [ 22/Jan/14 ]
tested with Win 7 and Win Server 2008
I am unable to reproduce this issue(build 2.0.0-1976, dp3 is no longer available)
Installed/uninstalled couchbase several times
Comment by Sriram Melkote [ 22/Jan/14 ]
Unfortunately, for this problem, if it did not reproduce, we can't say it is fixed. We have to find a machine where it reproduces and then verify a fix.

Anyway, no change made actually addresses the underlying problem (the registry key just gives a way to workaround it when it happens), so reopening the bug and targeting for 3.0
Comment by Sriram Melkote [ 23/Jan/14 ]
Bin - I just noticed that the Erlang installer itself (when downloaded from their website) installs VC redistributable in non-silent mode. The Microsoft runtime installer dialog pop us up, indicates it will install VC redistributable and then complete. Why do we run it in silent mode (and hence assume liability of it running properly)? Why do we not run the MSI in interactive mode like ESL Erlang installer itself does?
Comment by Wayne Siu [ 05/Feb/14 ]
If we could get the information on the exact software version, it could be helpful.
From registry, Computer\HKLM\Software\Microsoft\WindowsNT\CurrentVersion
Comment by Wayne Siu [ 12/Feb/14 ]
Bin, looks like the erl.ini was locked when this issue happened.
Comment by Pavel Paulau [ 19/Feb/14 ]
Just happened to me in 2.2.0-837.
Comment by Anil Kumar [ 18/Mar/14 ]
Triaged by Don and Anil as per Windows Developer plan.
Comment by Bin Cui [ 08/Apr/14 ]
http://review.couchbase.org/#/c/35463/
Comment by Chris Hillery [ 13/May/14 ]
I'm new here, but it seems to me that vcredist_x64.exe does exactly the same thing as the corresponding MS-provided merge module for MSVC2013. If that's true, we should be able to just include that merge module in our project, and not need to fork out to install things. In fact, as of a few weeks ago, the 3.0 server installers are doing just that.

http://msdn.microsoft.com/en-us/library/dn501987.aspx

Is my understanding incomplete in some way?
Comment by Chris Hillery [ 14/May/14 ]
I can confirm that the most recent installers do install msvcr120.dll and msvcp120.dll in apparently the correct places, and the server can start with them. I *believe* this means that we no longer need to fork out vcredist_x64.exe, or have any of the InstallShield tricks to detect whether it is needed and/or skip installing it, etc. I'm leaving this bug open to both verify that the current merge module-based solution works, and to track removal of the unwanted code.
Comment by Sriram Melkote [ 16/May/14 ]
I've also verified that 3.0 build installed VCRT (msvcp100) is sufficient for Erlang R16.
Comment by Bin Cui [ 15/Sep/14 ]
Recently I happen to reproduce this problem on my own laptop. Use setup.exe /verbose"c:\temp\verbose.log", i generated a log file with more verbose debugging information. At the end the file, it looks something like :

MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: OVERVIEW.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\overview\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: BUCKETS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_admin\buckets\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: MN_DIALOGS.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ABOUT.09DE5D66_88FD_4345_97EE_506873561EC1 , Object: C:\t5\lib\ns_server\priv\public\angular\app\mn_dialogs\about\
MSI (c) (C4:C0) [10:51:36:274]: Dir (target): Key: ALLUSERSPROFILE , Object: Q:\
MSI (c) (C4:C0) [10:51:36:274]: PROPERTY CHANGE: Adding INSTALLLEVEL property. Its value is '1'.

It means that installer tried to populate some property values for alluser profile after it copied all data to install location even though it shows this notorious "Computing space requirements" message.

From every installation, installer will use user temp directory to populate installer related data. After I delete or rename temp data under
c:\Users\<logonuser>\AppData\Temp, I reboot the machine. I solve the problem. at least for my laptop.

Conclusion:

1. After installed copied files, it needs to set alluser profiles. This action is synchronous and it waits and checks exit code. And certainly it will hangs on if this action never returns.

2. This is an issue related to setup environment, i.e. caused by other running applications, etc.

Suggestion:

1. Stop any other browers and applications when you install couchbase.
2. Kill the installation process and uninstall the failed setup.
3. Delete/rename the temp location under c:\Users\<logonuser>\AppData\Temp
4. Reboot and try again.

Comment by Bin Cui [ 17/Sep/14 ]
Turns out, it is really about the installation environment, not about a particular installation step.

Suggest to document the work around method.
Comment by Don Pinto [ 17/Sep/14 ]
Bin, some installers kill conflicting processes before installation starts so that it can complete. Why can't we do this?

(Maybe using something like this - http://stackoverflow.com/questions/251218/how-to-stop-a-running-process-during-an-msi-based-un-install)

Thanks,
Don




[MB-12126] there is not manifest file on windows 3.0.1-1253 Created: 03/Sep/14  Updated: 18/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 r2 64-bit

Attachments: PNG File ss 2014-09-03 at 12.05.41 PM.png    
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.1-1253 on windows server 2008 r2 64-bit. There is not manifest file in directory c:\Program Files\Couchbase\Server\



 Comments   
Comment by Chris Hillery [ 03/Sep/14 ]
Also true for 3.0 RC2 build 1205.
Comment by Chris Hillery [ 03/Sep/14 ]
(Side note: While fixing this, log onto build slaves and delete stale "server-overlay/licenses.tgz" file so we stop shipping that)
Comment by Anil Kumar [ 17/Sep/14 ]
Ceej - Any update on this?
Comment by Chris Hillery [ 18/Sep/14 ]
No, not yet.




[MB-9897] Implement upr cursor dropping Created: 13/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Chiyoung Seo [ 17/Sep/14 ]
This requires some significant changes in DCP and checkpointing in ep-engine. Moving this to post 3.0.1




[MB-12084] Create 3.0.0 chef-based rightscale template for EE and CE Created: 27/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Need this before 3.0 GA




[MB-12083] Create 3.0.0 legacy rightscale templates for Enterprise and Community Edition (non-chef) Created: 27/Aug/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We need this before 3.0 GA




[MB-12054] [windows] [2.5.1] cluster hang when flush beer-sample bucket Created: 22/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Abhinav Dangeti
Resolution: Cannot Reproduce Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows server 2008 R2

Attachments: Zip Archive 172.23.107.124-8222014-1546-diag.zip     Zip Archive 172.23.107.125-8222014-1547-diag.zip     Zip Archive 172.23.107.126-8222014-1548-diag.zip     Zip Archive 172.23.107.127-8222014-1549-diag.zip    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.5.1 on 4 nodes windows server 2008 R2 64-bit
Create a cluster of 4 nodes
Create beer-sample bucket
Enable flush in bucket setting.
Flush beer-sample bucket. Cluster became hang.

 Comments   
Comment by Abhinav Dangeti [ 11/Sep/14 ]
I wasn't able to reproduce this issue with a 2.5.1 build with 2 nodes.

From your logs on one of the nodes I see some couchNotifier logs, where we are waiting for mcCouch:
..
Fri Aug 22 14:00:03.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:21:53.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 14:43:43.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
Fri Aug 22 15:05:33.011000 Pacific Daylight Time 3: (beer-sample) Failed to send all data. Wait a while until mccouch is ready to receive more data, sent 0 remains = 56
...

This won't be a problem in 3.0.1, as mcCouch has been removed. Please re-open if you see this issue in your testing again.




[MB-11426] API for compact-in-place operation Created: 13/Jun/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Jens Alfke Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be convenient to have an explicit API for compacting the database in place, i.e. to the same file. This is what auto-compact does, but if auto-compact isn't enabled, or if the caller wants to run it immediately instead of on a schedule, then the caller has to use fdb_compact, which compacts to a separate file.

I assume the workaround is to compact to a temporary file, then replace the original file with the temporary. But this is several more steps. Since forestdb already contains the logic to compact in place, it'd be convenient if calling fdb_compact(handle, NULL) would do that.

 Comments   
Comment by Chiyoung Seo [ 10/Sep/14 ]
The change is in gerrit for review:

http://review.couchbase.org/#/c/41337/
Comment by Jens Alfke [ 10/Sep/14 ]
The notes on Gerrit say "a new file name will be automatically created by appending a file revision number to the original file name. …. Note that this new compacted file can be still opened by using the original file name"

I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it? When does the file get renamed back to the original filename, or does it ever? Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?
Comment by Chiyoung Seo [ 10/Sep/14 ]
>I don't understand what's going on here — after the compaction is complete, does the old file still exist or am I responsible for deleting it?

The old file is automatically removed by ForestDB after the compaction is completed.

>When does the file get renamed back to the original filename, or does it ever?

The file won't be renamed to the original name in the current implementation. But, I will adapt the current implementation so that when the file is closed and its ref counter becomes zero, the file can be renamed to its original name.

>Should my code ignore the fact that the file is now named "test.fdb.173" and always open it as "test.fdb"?

Yes, you can still open "test.fdb.173" by passing "test.fdb" file name.

Note that renaming it to the original file name right after finishing the compaciton becomes complicated as the other threads might traverse the old file's blocks (through buffer cache or OS page cache).

Comment by Chiyoung Seo [ 11/Sep/14 ]
I incorporated those answers into the commit message and API header file. Let me know if you have any suggestions / concerns.
Comment by Chiyoung Seo [ 12/Sep/14 ]
The change was merged into the master branch.




[MB-12082] Marketplace AMI - Enterprise Edition and Community Edition - provide AMI id to PM Created: 27/Aug/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cloud
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Anil Kumar Assignee: Wei-Li Liu
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Need AMI's before 3.0.0 GA

 Comments   
Comment by Wei-Li Liu [ 17/Sep/14 ]
3.0.0 EE AMI: ami-283a9440 Snapshots: snap-308fc192
3.0.0 CE AMI: ami-3237995a




[MB-12186] If flush can not be completed because of a timeout, we should not display a message "Failed to flush bucket" when it's still in progress Created: 15/Sep/14  Updated: 17/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Attachments: PNG File MB-12186.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When I tried to flush heavily loaded cluster I received "Failed To Flush Bucket" popup, in fact it not failed, but simply has not been completed for a set period of time(30 sec)?

expected behaviour: message like "flush is not complete, but continue..."

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
timeout is timeout. We can say "it timed out" be we cannot be sure if it's continuing or not.
Comment by Andrei Baranouski [ 15/Sep/14 ]
hm, we get timeout when removing bucket occurs much long, but we inform that the removal is still in progress, right?
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
You're right. I don't think we're entirely precise on bucket deletion timeout message. It's one of our mid-term goals to be better on this longer running ops and how their progress or results are exposed to user. I see not much value in tweaking messages. Instead we'll just make this entire thing work "right".




[MB-12202] UI shows a cbrestore as XDCR ops Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Linux 3.13.0-30-generic
[info] OS Version : Ubuntu 14.04 LTS
[info] CB Version : 2.5.1-1083-rel-enterprise

Attachments: PNG File cbrestoreXDCRops.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I noticed while doing a cbrestore of a backup on a cluster that doesn't have any XDCR configured that the stats in the UI showed ongoing ops for XDCR. (screenshot attached)

the stats code at
http://src.couchbase.org/source/xref/2.5.1/ns_server/src/stats_collector.erl#334 is including all set with meta as XDCR ops.

 Comments   
Comment by Aleksey Kondratenko [ 17/Sep/14 ]
That's the way it is. We have no way to distinguish sources of set-with-metas.




[MB-12189] (misunderstanding) XDCR REST API "max-concurrency" only works for 1 of 3 documented end-points. Created: 15/Sep/14  Updated: 17/Sep/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server, RESTful-APIs
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Jim Walker Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: supportability, xdcr
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Couchbase Server 2.5.1
RHEL 6.4
VM (VirtualBox0
1 node "cluster"

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
This defect relates to the following REST APIs:

* xdcrMaxConcurrentReps (default 32) http://localhost:8091/internalSettings/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/ <replication_id>

The documentation suggests these all do the same thing, but with the scope of change being different.

<docs>
/settings/replications/ — global settings applied to all replications for a cluster
settings/replications/<replication_id> — settings for specific replication for a bucket
/internalSettings - settings applied to all replications for a cluster. Endpoint exists in Couchbase 2.0 and onward.
</docs>

This defect is because only "settings/replications/<replication_id>" has any effect. The other REST endpoints have no effect.

Out of these APIs I can confirm that changing "/settings/replications/<replication_id>" has an effect. The XDCR code shows that the concurrent reps setting feeds into the concurreny throttle as the number of available tokens. I use xdcr log files where we print the concurrency throttle token data to observe that the setting has an effect.

For example, a cluster in the default configuration has a total tokens of 32. We can grep to see this.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
2014-09-15T13:09:03.886,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.33.1> to node "192.168.69.102:8092" is done normally, total tokens: 32, available tokens: 32,(active reps: 0, waiting reps: 0)

Now changing the setting to 42 the log file shows the change take affect.

curl -u Administrator:password http://localhost:8091/settings/replications/01d38792865ba2d624edb4b2ad2bf07f%2fdefault%2fdefault -d maxConcurrentReps=42

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
dcr.1:[xdcr:debug,2014-09-15T13:17:41.112,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.2321.1> to node "192.168.69.102:8092" is done normally, total tokens: 42, available tokens: 42,(active reps: 0, waiting reps: 0)

Since this defect is that both of the other two REST end-points don't appear to have any affect here's an example changing "settings/replication". This example was on a clean cluster, i.e. no other settings have been changed. Only creating bucket and replication + client writes has been performed.

root@localhost logs]# curl -u Administrator:password http://localhost:8091/settings/replications/ -d maxConcurrentReps=48
{"maxConcurrentReps":48,"checkpointInterval":1800,"docBatchSizeKb":2048,"failureRestartInterval":30,"workerBatchSize":500,"connectionTimeout":180,"workerProcesses":4,"httpConnections":20,"retriesPerRequest":2,"optimisticReplicationThreshold":256,"socketOptions":{"keepalive":true,"nodelay":false},"supervisorMaxR":25,"supervisorMaxT":5,"traceDumpInvprob":1000}

Above shows that the JSON has acknowledged the value of 48 but the log files show no change. After much waiting and re-checking grep shows no evidence.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 48" | wc -l
0
[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 32" | wc -l
7713

The same was observed for /internalSettings/

Found on both 2.5.1 and 3.0.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
This is because global settings affect new replications or replications without per-replication settings defined. UI always defines all per-replication settings.
Comment by Jim Walker [ 16/Sep/14 ]
Have you pushed a documentation update for this?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
No. I don't own docs.
Comment by Jim Walker [ 17/Sep/14 ]
Then this issue is not resolved.

Closing/resolving this defect with breadcrumbs to the opening of an issue on a different project would suffice as a satisfactory resolution.

You can also very easily put a pull request into docs on github with the correct behaviour.

Can you please perform *one* of those task so that the REST api here is correctly documented with the behaviours you are aware of and this matter can be closed.
Comment by Jim Walker [ 17/Sep/14 ]
Resolution requires either:

* Corrected documentation pushed to documentation repository.
* Enough accurate API information placed into a documentation defect so docs-team can correct.





[MB-11084] Build python snappy module on windows Created: 09/May/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Minor
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
To deal with compressed datatype, we need to python support for snappy function. We need to build https://github.com/andrix/python-snappy on windows and make it part of package.

 Comments   
Comment by Bin Cui [ 09/May/14 ]
I implement related logic for centos 5.x, 6.x and ubuntu. Please look at http://review.couchbase.org/#/c/36902/
Comment by Trond Norbye [ 16/Jun/14 ]
I've updated the windows build depot with the modules built for Python 2.7.6.

Please populate the depot to the builder and reassing the bug to Bin for verification.
Comment by Chris Hillery [ 13/Aug/14 ]
Depot was updated yesterday, so pysnappy is expanded into the install directory before the Couchbase build is started. I'm not sure what needs to be done to then use this package; passing off to Bin.
Comment by Don Pinto [ 03/Sep/14 ]
Question : Given that compressed datatype is not in 3.0 - is this still a requirement?

Thanks,




[MB-8508] installer - windows packages should be signed Created: 26/Nov/12  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0, 2.1.0, 2.2.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Task Priority: Critical
Reporter: Steve Yen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-5577 print out Couchbase in the warning sc... Open
relates to MB-9165 Windows 8 Smartscreen blocks Couchbas... Resolved

 Description   
see also: http://www.couchbase.com/issues/browse/MB-7250
see also: http://www.couchbase.com/issues/browse/MB-49


 Comments   
Comment by Steve Yen [ 10/Dec/12 ]
Part of the challenge here would be figuring out the key-ownership process. Perhaps PM's should go create, register and own the signing keys/certs.
Comment by Steve Yen [ 31/Jan/13 ]
Reassigning as I think Phil has been tracking down the keys to the company.
Comment by Phil Labee [ 01/May/13 ]
Need more information:

Why do we need to sign windows app?
What problems are we addressing?
Do you want to release through the Windows Store?
What versions of Windows do we need to support?
Comment by Phil Labee [ 01/May/13 ]
need to know what problem we're trying to solve
Comment by Wayne Siu [ 06/Sep/13 ]
No security warning box is the objective.
Comment by Wayne Siu [ 20/Jun/14 ]
Anil,
I assume this is out of 3.0. Please update if it's not.
Comment by Anil Kumar [ 20/Jun/14 ]
we should still consider it for 3.0 unless there is no time to fix then candidate for punting.
Comment by Wayne Siu [ 30/Jul/14 ]
Moving it out of 3.0.
Comment by Anil Kumar [ 17/Sep/14 ]
we need this for Windows 3.0 GA timeframe




[MB-9825] Rebalance exited with reason bad_replicas Created: 06/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Venu Uppalapati
Resolution: Unresolved Votes: 0
Labels: performance, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.5.0 enterprise edition (build-1015)

Platform = Physical
OS = Windows Server 2012
CPU = Intel Xeon E5-2630
Memory = 64 GB
Disk = 2 x HDD

Triage: Triaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/zeus-64/564/artifact/

 Description   
Rebalance-out, 4 -> 3, 1 bucket x 50M x 2KB, DGM, 1 x 1 views

Bad replicators after rebalance:
Missing = [{'ns_1@172.23.96.27','ns_1@172.23.96.26',597},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',598},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',599},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',600},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',601},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',602},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',603},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',604},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',605},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',606},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',607},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',608},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',609},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',610},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',611},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',612},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',613},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',614},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',615},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',616},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',617},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',618},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',619},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',620},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',621},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',622},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',623},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',624},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',625},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',626},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',627},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',628},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',629},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',630},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',631},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',632},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',633},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',634},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',635},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',636},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',637},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',638},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',639},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',640},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',641},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',642},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',643},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',644},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',645},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',646},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',647},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',648},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',649},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',650},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',651},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',652},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',653},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',654},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',655},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',656},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',657},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',658},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',659},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',660},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',661},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',662},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',663},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',664},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',665},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',666},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',667},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',668},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',669},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',670},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',671},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',672},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',673},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',674},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',675},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',676},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',677},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',678},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',679},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',680},
{'ns_1@172.23.96.27','ns_1@172.23.96.26',681}]
Extras = []

 Comments   
Comment by Aleksey Kondratenko [ 06/Jan/14 ]
Looks like producer node simply closed socket.

Most likely duplicate of old issue where both socket sides suddenly see connection as closed.

Relevant log messages:

[error_logger:info,2014-01-06T10:30:00.231,ns_1@172.23.96.26:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
          supervisor: {local,'ns_vbm_new_sup-bucket-1'}
             started: [{pid,<0.1169.0>},
                       {name,
                           {new_child_id,
                               [597,598,599,600,601,602,603,604,605,606,607,
                                608,609,610,611,612,613,614,615,616,617,618,
                                619,620,621,622,623,624,625,626,627,628,629,
                                630,631,632,633,634,635,636,637,638,639,640,
                                641,642,643,644,645,646,647,648,649,650,651,
                                652,653,654,655,656,657,658,659,660,661,662,
                                663,664,665,666,667,668,669,670,671,672,673,
                                674,675,676,677,678,679,680,681],
                               'ns_1@172.23.96.27'}},
                       {mfargs,
                           {ebucketmigrator_srv,start_link,
                               [{"172.23.96.27",11209},
                                {"172.23.96.26",11209},
                                [{on_not_ready_vbuckets,
                                     #Fun<tap_replication_manager.2.133536719>},
                                 {username,"bucket-1"},
                                 {password,get_from_config},
                                 {vbuckets,
                                     [597,598,599,600,601,602,603,604,605,606,
                                      607,608,609,610,611,612,613,614,615,616,
                                      617,618,619,620,621,622,623,624,625,626,
                                      627,628,629,630,631,632,633,634,635,636,
                                      637,638,639,640,641,642,643,644,645,646,
                                      647,648,649,650,651,652,653,654,655,656,
                                      657,658,659,660,661,662,663,664,665,666,
                                      667,668,669,670,671,672,673,674,675,676,
                                      677,678,679,680,681]},
                                 {set_to_pending_state,false},
                                 {takeover,false},
                                 {suffix,"ns_1@172.23.96.26"}]]}},
                       {restart_type,temporary},
                       {shutdown,60000},
                       {child_type,worker}]



[rebalance:debug,2014-01-06T12:12:33.870,ns_1@172.23.96.26:<0.1169.0>:ebucketmigrator_srv:terminate:737]Dying with reason: normal

Mon Jan 06 12:12:44.371917 Pacific Standard Time 3: (bucket-1) TAP (Producer) eq_tapq:replication_ns_1@172.23.96.26 - disconnected, keep alive for 300 seconds
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
looks like a dupe of memcached connection issue.
will close this as a dupe.
Comment by Wayne Siu [ 15/Jan/14 ]
Chiyoung to add more debug logging to 2.5.1.
Comment by Chiyoung Seo [ 17/Jan/14 ]
I added more warning-level logs for disconnection events in the memcached layer. We will continue to investigate this issue for 2.5.1 or 3.0 release.

http://review.couchbase.org/#/c/32567/

merged.
Comment by Cihan Biyikoglu [ 08/Apr/14 ]
Given we have more verbose logging, can we reproduce the issue again and see if we can get a better idea on where the problem is?
thanks
Comment by Pavel Paulau [ 08/Apr/14 ]
This issue happened only on Windows so far.
I wasn't able to reproduce it in 2.5.1 and obviously we haven't tested 3.0 yet.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
Pavel, do you have the repro with the detailed logs now? if yes, could we assign to a dev for fixing?
Comment by Pavel Paulau [ 25/Jun/14 ]
This is Windows specific bug. We are not testing Windows yet.
Comment by Pavel Paulau [ 27/Jun/14 ]
Just FYI.

I have finally tried Windows build. It's absolutely unstable and not ready for performance testing yet.
Please don't expect news any time soon.




[MB-9874] [Windows] Couchstore drop and reopen of file handle fails Created: 09/Jan/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows, windows_pm_triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
The unit test doing couchstore_drop_file and couchstore_repoen_file fails due to COUCHSTORE_READ_ERROR when it tries to reopen the file.

The commit http://review.couchbase.org/#/c/31767/ disabled the test to allow the rest of the unit tests to be executed.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-9635] Audit logs for Admin actions Created: 22/Nov/13  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Anil Kumar Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate

 Description   
Couchbase Server should be able to get an audit logs for all Admin actions such-as login/logout events, significant events (rebalance, failover, etc) etc.



 Comments   
Comment by Matt Ingenthron [ 13/Mar/14 ]
Note there isn't exactly a "login/logout" event. This is mostly by design. A feature like this could be added, but there may be better ways to achieve the underlying requirement. One suggestion would be to log initial activities instead of every activity and have a 'cache' for having seen that user agent within a particular window. That would probably meet most auditing requirements and is, I think, relatively straightforward to implement.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
We have access.log implemented now. But it's not exactly same as full-blown audit. Particularly we do log that certain POST was handled in access.log, but we do not log any parameters of that action. So it doesn't count as fullly-featured audit log I think.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log for log and ep-engine's access.log do not conflict due to being in necessarily different directories.
Comment by Perry Krug [ 06/Jun/14 ]
They may not conflict in terms of unique names in the same directory, but to our customers it may be a little bit too close to remember which access.log does what...
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
Ok. Any specific proposals ?
Comment by Perry Krug [ 06/Jun/14 ]
Yes, as mentioned above, login.log would be one proposal but I'm not tied to it.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
access.log has very little to do with logins. It's full blown equivalent of apache's access.log.
Comment by Perry Krug [ 06/Jun/14 ]
Oh sorry, I misread this specific section.

How about audit.log? I know it's not fully "audit" but I'm just trying to avoid the name clash in our customer's minds...
Comment by Anil Kumar [ 09/Jun/14 ]
Agreed we should rename this file to audit.log to avoid any confusion. Updating the MB-10020 to make that change.
Comment by Larry Liu [ 10/Jun/14 ]
Hi, Anil

Does this feature satisfy PCI compliance?

Larry
Comment by Cihan Biyikoglu [ 11/Jun/14 ]
Hi Larry, PCI is a comprehensive set of requirements that go beyond database features. This does help with some part of PCI but talking about compliance with PCI involve many additional controls and most can be done at the operational levels or at the app level.
thanks




[MB-12200] Seg fault during indexing on view-toy build testing Created: 16/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Harsha Havanur
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: -3.0.0-700-hhs-toy
-Cen 64 Machines
- 7 Node cluster, 2 Buckets, 2 Views

Attachments: Zip Archive 10.6.2.168-9162014-106-diag.zip     Zip Archive 10.6.2.187-9162014-1010-diag.zip     File crash_beam.smp.rtf     File crash_toybuild.rtf    
Issue Links:
Duplicate
is duplicated by MB-11917 One node slow probably due to the Erl... Open
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Load 70M, 100M on either bucket
2. Wait for initial indexing to complete
3. Start updates on the cluster 1K gets, 7K sets across the cluster

Seeing numerous cores from beam.smp.

Stack is attached.

Adding logs from the nodes.


 Comments   
Comment by Sriram Melkote [ 16/Sep/14 ]
Harsha, this appears to clearly be a NIF related regression. We need to discuss why our own testing didn't find this after you figure out the problem.
Comment by Volker Mische [ 16/Sep/14 ]
Siri, I haven't checked if it's the same issue, but the current patch doesn't pass our unit tests. See my comment at http://review.couchbase.org/41221
Comment by Ketaki Gangal [ 16/Sep/14 ]
Logs https://s3.amazonaws.com/bugdb/bug-12200/bug_12200.tar
Comment by Harsha Havanur [ 17/Sep/14 ]
The issue Volker mentioned is one of queue size. I am suspecting that if a context is in queue beyond 5 seconds and terminator loop destroys context and when doMapDoc loop dequeues the task it will result in SEGV if the ctx is already destroyed. Trying a fix with both increasing queue size as well as handling destroyed contexts.
Comment by Sriram Melkote [ 17/Sep/14 ]
Folks, let's follow this on MB-11917 as it's clear now that this bug is caused by the toy build as a result of proposed fix for MB-11917.




[MB-12206] New 3.0 Doc Site, View and query pattern samples unparsed markup Created: 17/Sep/14  Updated: 17/Sep/14  Resolved: 17/Sep/14

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Ruth Harris
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On the page

http://draft.docs.couchbase.com/prebuilt/couchbase-manual-3.0/Views/views-querySample.html

The view code examples under 'General advice' are not displayed properly.

 Comments   
Comment by Ruth Harris [ 17/Sep/14 ]
Fixed. Legacy formatting issues from previous source code.




[MB-12207] Related links could be clearer. Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Patrick Varley Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I think it would be better if "Related link" at the bottom of the page was layout a little different and we added the ability to navigate (MB-12205) from the bottom of a page(Think long pages).

Maybe something like this could work:

Links

Parent Topic:
    Installation and upgrade
Previous Topic:
    Welcome to couchbase
Next Topic:
    uninstalling couchbase
Related Topics:
    Initial server setup
    Testing Couchbase Server
    Upgrading




[MB-12195] Update notifications does not seem to be working Created: 15/Sep/14  Updated: 17/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Ian McCloy
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos 5.8
2.5.0

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I have installed 2.5.0 build and enabled Update Notifications
Even though I enabled "Enable software Update Notifications", I keep getting "No Updates available"
I thought I will be notified in the UI that there is a 2.5.1 is available.

I have consulted Tony to see if I have done something wrong but he also confirmed that this seems to be an issue and is a bug

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
Based on dev tools we're getting "no new version" from phone home requests. So it's not UI bug.
Comment by Ian McCloy [ 17/Sep/14 ]
Added the missing available upgrade paths to the database,

2.5.0-1059-rel-enterprise -> 2.5.1-1083-rel-enterprise
2.2.0-837-rel-enterprise -> 2.5.1-1083-rel-enterprise
2.1.0-718-rel-enterprise -> 2.2.0-837-rel-enterprise

but it looks like the code that parses http://ph.couchbase.net/v2?callback=jQueryxxx isn't checking the database.




[MB-12205] Doc-system: does not have a next page button. Created: 17/Sep/14  Updated: 17/Sep/14

Status: Open