[MB-10960] Add Debian package to 3.0 release - Debian 7.0 Created: 24/Apr/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Sriram Melkote Assignee: Phil Labee
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Duplicate
Relates to

 Description   
Debian is consistently in the top 3 distributions in server market, by almost every count. For example, below ranks the distribution of the top 10 million web servers:

http://w3techs.com/technologies/overview/operating_system/all

You can look at various other surveys, and you'll see the message is the same. Debian is pretty much at the top for servers. Yet, we don't ship packages for it. This is quite hard to understand because we're already building .deb for Ubuntu, and it takes only a few minor changes to make it compatible with Debian/Stable.

While I don't track customer requests, I've anecdotally seen them requesting the exact same thing in unambiguous terms.


 Comments   
Comment by Sriram Melkote [ 14/May/14 ]
Eric, Ubuntu and Debian are different distributions.
Comment by Jacob Lundberg [ 10/Jul/14 ]
Could somebody please add my user (jacoblundberg) to view CBSE-1140 if that is where this work will be done? This is an important request for CollegeNET and I want to be able to view the status.
Comment by Brent Woodruff [ 10/Jul/14 ]
Hi Jacob. I am not familiar with that ticket (CBSE-1140) firsthand, so perhaps someone else would be better for discussing that issue with you. However, I wanted to let you know that all CBSE tickets are internal to Couchbase. We will not be able to grant you access to view that ticket's contents.

MB tickets such as this one are public, and updates to the status of Couchbase's support of the Debian platform for Couchbase Server will appear here.

I recommend that you communicate with Couchbase Support via email or the web if you have a question about the status of work you are expecting to be completed, which has either a related Couchbase Support ticket or a CBSE ticket.

Support email: support@couchbase.com
Support portal: http://support.couchbase.com
Comment by Anil Kumar [ 11/Jul/14 ]
Here are the details for adding support to Debian -
- Support current stable distribution of Debian is version 7
- Only 64bit

Comment by Phil Labee [ 14/Jul/14 ]
The status quo is that 2.x does not build on Debian and it does not install cleanly on Debian. This is a new platform for us, which means new build infrastructure and new installer files.

We do currently support ubuntu, which produces *.deb file for installation. So what we have may be close, but since this is a new platform it is unclear how much work will be required.
Comment by Sriram Melkote [ 14/Jul/14 ]
OK - I've removed my suggestion to split it into two tasks. Let's treat this as a new platform as suggested.
Comment by Anil Kumar [ 15/Jul/14 ]
Ceej/Phil - As mentioned before Debian is a new platform we will start supporting from 3.0.

- Support Debian 7
- Only 64bit

Comment by Phil Labee [ 23/Jul/14 ]
I need 4 VMs with Debian 7.0. Each needs:

    4 Gig RAM
    4 CPUs
    100 Gig of disk space

These machines may be used for building the server, or for running smoke tests like in CBIT-956

Please take a snapshot of each after configuring. I'm going to install a build environment but we may need to roll back in case we need to re-purpose any of these hosts.
Comment by Phil Labee [ 23/Jul/14 ]
similar use case




[MB-10371] tcmalloc must be compiled with -DTCMALLOC_SMALL_BUT_SLOW [ 1 ] Created: 05/Mar/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Phil Labee
Resolution: Unresolved Votes: 0
Labels: 2.5.1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-10440 something isn't right with tcmalloc i... Resolved
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Triaged
Is this a Regression?: Yes

 Description   
Possible candidate for 2.5.1, given it's easy fix.

Based on MB-7887. And particularly on runs comparing older 2.2.0 builds and newer 2.2.0 builds (after 2.2.0 release, I guess those were purely for hotfixes or who knows what) we see tcmalloc memory fragmentation regression. I.e. here: https://www.couchbase.com/issues/secure/attachment/19549/10KB-1MB_250Items_10KB_delta.log

We see 2.2.0-821 (GA) is having:

2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] detailed: NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER.
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] ------------------------------------------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 645831408 ( 615.9 MiB) Bytes in use by application
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 120201216 ( 114.6 MiB) Bytes in page heap freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 10568336 ( 10.1 MiB) Bytes in central cache freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 0 ( 0.0 MiB) Bytes in transfer cache freelist
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 2810496 ( 2.7 MiB) Bytes in thread cache freelists
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 1831064 ( 1.7 MiB) Bytes in malloc metadata
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: ------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: = 781242520 ( 745.1 MiB) Actual memory used (physical + swap)
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: + 26411008 ( 25.2 MiB) Bytes released to OS (aka unmapped)
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: ------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: = 807653528 ( 770.2 MiB) Virtual address space used
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC:
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 6848 Spans in use
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 18 Thread heaps in use
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] MALLOC: 8192 Tcmalloc page size
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] ------------------------------------------------
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Bytes released to the OS take up virtual address space but no physical memory.

I.e. about 750 megs of ram is used

and on 2.2.0-840 (that later build I referred to above) eats 820 megs. And there's similar situation with 2.5.0 GA.

Only regression of 2.2.0-840 vs. 2.2.0-821 that's apparent here is lack of -DTCMALLOC_SMALL_BUT_SLOW in build (which is seen by lack of warning like this " NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER." in later builds).

Therefore we should restore that define that's known to be passed in earlier builds and which was somehow dropped in later builds. It's possible that it was intentional for some specific reason, but I'm not aware of that. Looks like simple regression instead.

Related bugs are MB-7887 and MB-9930.

 Comments   
Comment by Phil Labee [ 07/Mar/14 ]
abandonded change: http://review.couchbase.org/#/c/34270/

instead add this to the make command:

   libtcmalloc_EXTRA_OPTIONS="-DTCMALLOC_SMALL_BUT_SLOW"

this is made in the membase / buildbot-internal / Buildbot / master.cfg

https://github.com/membase/buildbot-internal/commit/6d7da38047eaf8ba9fb89aa0544c3cdc8697f53b
Comment by Phil Labee [ 11/Mar/14 ]
voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Phil Labee [ 11/Mar/14 ]
fixed in build: 2.5.1-1074
Comment by Maria McDuff (Inactive) [ 11/Mar/14 ]
Phil,

is this the build ready for QE to test, 2.5.1-1074?
Comment by Maria McDuff (Inactive) [ 11/Mar/14 ]
Andrei,

can you confirm that this msg is no longer appearing on this new build, 1074? Thanks.
all OS:

2014-02-11 03:57:31 | INFO | MainProcess | MainThread | [remote_util.log_command_output] detailed: NOTE: SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER.

-win32,64 bit
-centos32,64 bit
-ubuntu32,64 bit
-mac64 - i can verify this since you don't have a mac
Comment by Andrei Baranouski [ 11/Mar/14 ]
Hi Maria,
I have mac also to check it out. I have already talked with Alk on what scenarios to test the build. give the results tomorrow
Comment by Maria McDuff (Inactive) [ 14/Mar/14 ]
fyi, andrei, working tcmalloc build is still not available....
Comment by Wayne Siu [ 26/Mar/14 ]
There are some issues uncovered related to the fix.




[MB-11783] We need the administrator creds available in isasl.pw Created: 22/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Trond Norbye Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We'd like to add authentication for some of the operations (like setting configuration tunables dynamically). Instead of telling the user to go look in on the system for isasl.pw and dig out the _admin entry and then use that and the generated password, it would be nice if the credentials defined when setting up the cluster could be used.

 Comments   
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Interesting. Very much :)

I cannot do it because we don't have administrator creds anymore. We just have some kind of password hash and that's it.

I admit my fault. I could be more forward looking. But it was somewhat guided by your response back in the day which I interpreted as reluctance to allow memcached auth via admin credentials.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And of course all admin ops to memcached can still be safely and in more controlled way (global if needed, or local if needed) be handled by ns_server.
Comment by Dave Rigby [ 30/Jul/14 ]
Alk: To give you some flavour for why Trond asked for this, we've added a "get out of jail" card for tcmalloc problems - new binary protocol messages (IOCTL_GET / IOCTL_SET) to release free memory and a command-line util (mcctl) to trigger it - see MB-11772. This is a memcached-wide operation and hence doesn't really make sense to have bucket-level auth, and instead should probably have Admin-level.

For this feature (MB-11772) it's not a big problem to not have auth on it (as it should be harmless), but we were also thinking ahead to things like changing the max connections on the fly (using the same binary protocol commands) and this *does* need auth.

Therefore I think we will need one of:

    1) The raw password for memcached to authenticate against.
    2) ns_server (GUI) support to change max_connections (using the new IOCTL_SET command)
    3) A method to check the hashed password from memcached (so it can authenticate the user)

It sounds like (1) isn't (currently) possible, so either one of (2) or (3) would be needed to support this use-case.

For things like these maintenance-level operations it would be helpful to be able to trigger things like "release_free_memory" on a per-node basis, so having the ability to selectively issue an IOCTL_SET to specific node(s) (and not just say globally from the GUI) does have value, it also means that we don't necessarily have to spend the time and effort in adding UI support for every "diagnostic" feature, so having (3) - even if the GUI also provides access for selective IOCTL_SET, such as max_conns - would be very helpful.


Comment by Aleksey Kondratenko [ 30/Jul/14 ]
max connections ? How it's related to memory?

I don't know how permanent you want this fix to be. If it's not so permanent you can have your python script use admin password to /diag/eval node's memcached password which can then be used to auth against memcached.

If it's more permanent feature then I believe mere per-node script is not enough and we'll need something more principled implemented through ns_server (e.g. like bucket flush).

Also if it's related to memory have you seen this ?: https://code.google.com/p/gperftools/source/detail?r=a92fc76f72318f7a46e91d9ef6dd24f2bcf44802




[MB-11772] Provide the facility to release free memory back to the OS from running mcd process Created: 21/Jul/14  Updated: 30/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On many occasions we have seen tcmalloc being very "greedy" with free memory and not releasing it back to the OS very quickly. There have even been occasions where this has triggered the Linux OOM-killer due to the memcached process's having too much "free" tcmalloc memory still resident.

tcmalloc by design will /slowly/ return memory back to the OS - via madvise(DONT_NEED) - but this rate is very conservative, and it can only be changed currently by modifying an environment variable, which obviously cannot be done on a running process.

To help mitigate these problems in future, it would be very helpful to allow the user to request that free memory is released back to the OS.


 Comments   
Comment by Dave Rigby [ 21/Jul/14 ]
http://review.couchbase.org/#/c/39608/
Comment by Aleksey Kondratenko [ 30/Jul/14 ]
tcmalloc madvise settings can be changed on a running process. And there's also https://code.google.com/p/gperftools/source/detail?r=a92fc76f72318f7a46e91d9ef6dd24f2bcf44802 as of gperftools 2.2




[MB-11799] Bucket compaction causes massive slowness of flusher and UPR consumers Created: 23/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Issue Links:
Duplicate
is duplicated by MB-11731 Persistence to disk suffers from buck... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.
Comment by Pavel Paulau [ 26/Jul/14 ]
Toy build helps a lot.

It doesn't fix the problem but at least minimize regression:
-- ep_upr_replica_items_remaining is close to zero now
-- write queue is 10x lower
-- max xdcr latency is about 8-9 second

Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/530/
Reports:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-785-toy_6ed_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-785-toy_269_access
Comment by Chiyoung Seo [ 26/Jul/14 ]
Thanks Pavel for the updates. We will merge the above changes soon.

Do you mean that both the disk write queue size and XDCR latency are still regression? or XDCR is only your major concern?

As you pointed above, the recent change in parallelizing the compaction (4 by default) is mostly the main root cause of this issue. Do you still see the compaction slowness in your tests? I guess "no" because we can now run 4 concurrent compaction tasks on each node.

I will talk to Aliaksey to understand that change more.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Pavel,

I will continue to look at some more optimizations in the ep-engine side. In the mean time, you may want to test the toy build again by lowering compaction_number_of_kv_workers in the ns-server side from 4 to 1. As mentioned in http://review.couchbase.org/#/c/39647/ , that parameter is configurable in the ns-server side.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Btw, all the changes above were merged. You can use the new build and lower the above compaction parameter.
Comment by Pavel Paulau [ 28/Jul/14 ]
Build 3.0.0-1035 with compaction_number_of_kv_workers = 1:

http://ci.sc.couchbase.com/job/perf-dev/533/artifact/

Source: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1035_276_access
Destination: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1035_624_access

Disk write queue is lower (max ~5-10K) but xdcr latency is still high (several seconds) and affected by compaction.
Comment by Chiyoung Seo [ 30/Jul/14 ]
Pavel,

The following change is merged:

http://review.couchbase.org/#/c/40043/

I plan to make another change for this issue today, but you may want to test it with the new build that includes the above fix
Comment by Chiyoung Seo [ 30/Jul/14 ]
I just pushed another important change in gerrit for review:

http://review.couchbase.org/#/c/40059/




[MB-11780] [Upgrade tests Ubuntu-12.04] Before upgrade, after configuring XDCR on 2.0.0-1976, node become un-responsive Created: 22/Jul/14  Updated: 30/Jul/14  Resolved: 30/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Version: 2.0.0-1976-rel
Ubuntu 12.04

Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11780/2b77fbd0/10.3.3.218-diag.txt.gz
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11780/f4673f14/10.3.3.218-7192014-753-diag.zip
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11780/066eb643/10.3.3.240-diag.txt.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11780/d169dc68/10.3.3.240-7192014-753-diag.zip

[Destination]
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11780/481a8e90/10.3.3.225-7192014-754-diag.zip
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11780/566270c9/10.3.3.225-diag.txt.gz
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11780/52202c4f/10.3.3.239-7192014-754-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11780/7107d70c/10.3.3.239-diag.txt.gz

Issue occurred on 10.3.3.239, which was destination master.
Is this a Regression?: Unknown

 Description   
1. Test failed before upgrade only.
2. Installed Source and Destination nodes 2.0.0-1976-rel
3. Changed global xdcr settings xdcrFailureRestartInterval=1 and xdcrCheckpointInterval=60 on each cluster.
3. Created remote clusters cluster0 and cluster1 for bi-directional XDCR.
4. Node 10.3.3.239 (Destination master) node become un-responsive.

[Notes]
1. Test worked fine for CentOS.
2. 3 tests are failed because of this issue, all are related to 2.0.0-1976-rel upgrade to 3.0.

[Jenkins]
http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/24/consoleFull

[Test]
./testrunner -i ubuntu_x64--36_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.offline_cluster_upgrade,initial_version=2.0.0-1976-rel,sdata=False,bucket_topology=default:1>2;bucket0:1><2,upgrade_nodes=dest;src,use_encryption_after_upgrade=src;dest


[Failure]
2014-07-19 07:50:28,182] - [basetestcase:264] INFO - sleep for 30 secs. ...
[2014-07-19 07:50:58,191] - [xdcrbasetests:1089] INFO - Setting xdcrFailureRestartInterval to 1 ..
[2014-07-19 07:50:58,204] - [rest_client:1726] INFO - Update internal setting xdcrFailureRestartInterval=1
[2014-07-19 07:50:58,262] - [rest_client:1726] INFO - Update internal setting xdcrFailureRestartInterval=1
[2014-07-19 07:50:58,263] - [xdcrbasetests:1089] INFO - Setting xdcrCheckpointInterval to 60 ..
[2014-07-19 07:50:58,278] - [rest_client:1726] INFO - Update internal setting xdcrCheckpointInterval=60
[2014-07-19 07:50:58,382] - [rest_client:1726] INFO - Update internal setting xdcrCheckpointInterval=60
[2014-07-19 07:50:58,392] - [rest_client:828] INFO - adding remote cluster hostname:10.3.3.239:8091 with username:password Administrator:password name:cluster1 to source node: 10.3.3.240:8091
[2014-07-19 07:50:58,780] - [rest_client:828] INFO - adding remote cluster hostname:10.3.3.240:8091 with username:password Administrator:password name:cluster0 to source node: 10.3.3.239:8091
[2014-07-19 07:50:59,048] - [rest_client:874] INFO - starting continuous replication type:capi from default to default in the remote cluster cluster1
[2014-07-19 07:50:59,250] - [basetestcase:264] INFO - sleep for 5 secs. ...
[2014-07-19 07:51:04,256] - [rest_client:874] INFO - starting continuous replication type:capi from bucket0 to bucket0 in the remote cluster cluster1
[2014-07-19 07:51:04,559] - [basetestcase:264] INFO - sleep for 5 secs. ...
[2014-07-19 07:51:15,538] - [rest_client:747] ERROR - http://10.3.3.239:8091/nodes/self error 500 reason: unknown ["Unexpected server error, request logged."]
http://10.3.3.239:8091/nodes/self with status False: [u'Unexpected server error, request logged.']
[2014-07-19 07:51:15,538] - [xdcrbasetests:139] ERROR - list indices must be integers, not str
[2014-07-19 07:51:15,539] - [xdcrbasetests:140] ERROR - Error while setting up clusters: (<type 'exceptions.TypeError'>, TypeError('list indices must be integers, not str',), <traceback object at 0x31dd7a0>)
[2014-07-19 07:51:15,540] - [xdcrbasetests:179] INFO - ============== XDCRbasetests cleanup is started for test #11 offline_cluster_upgrade ==============


 Comments   
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Logs from 10.3.3.239 at that duration:

[ns_server:debug,2014-07-19T7:50:40.998,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 0 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.137,ns_1@10.3.3.239:<0.995.0>:mc_connection:do_notify_vbucket_update:112]Signaled mc_couch_event: {set_vbucket,"bucket0",843,active,0}
[ns_server:debug,2014-07-19T7:50:41.142,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 1 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.143,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 2 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.139,ns_1@10.3.3.239:capi_set_view_manager-bucket0<0.979.0>:capi_set_view_manager:handle_info:377]Usable vbuckets:
[997,933,869,984,920,856,971,907,843,958,894,1022,945,881,1009,996,964,932,
 900,868,983,951,919,887,855,1015,970,938,906,874,1002,989,957,925,893,861,
 1021,976,944,912,880,848,1008,995,963,931,899,867,982,950,918,886,854,1014,
 969,937,905,873,1001,988,956,924,892,860,1020,975,943,911,879,847,1007,994,
 962,930,898,866,981,949,917,885,853,1013,968,936,904,872,1000,987,955,923,
 891,859,1019,974,942,910,878,846,1006,993,961,929,897,865,980,948,916,884,
 852,1012,999,967,935,903,871,986,954,922,890,858,1018,973,941,909,877,845,
 1005,992,960,928,896,864,979,947,915,883,851,1011,998,966,934,902,870,985,
 953,921,889,857,1017,972,940,908,876,844,1004,991,959,927,895,863,1023,978,
 946,914,882,850,1010,965,901,952,888,1016,939,875,1003,990,926,862,977,913,
 849]
[ns_server:debug,2014-07-19T7:50:41.150,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 3 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.158,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 4 ({not_found,no_db_file}). Ignoring
[views:debug,2014-07-19T7:50:41.246,ns_1@10.3.3.239:mc_couch_events<0.428.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for bucket0/842. Updated state: active (0)
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Problem appeared on ubuntu platform only.
Comment by Aleksey Kondratenko [ 30/Jul/14 ]
If I understand correctly it's completely unrelated to upcoming release. Any issues in old release are of historic interest now




[MB-11237] DCP stats should report #errors under the web console Created: 28/May/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
we should consider adding in a error/sec counter to the stats category under the web console. the counter would report the # errors being seen through the DCP protocol to indicate issues with communication among nodes

 Comments   
Comment by Anil Kumar [ 10/Jun/14 ]
Triage - June 10 2014 Anil
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.




[MB-11733] One node is slow during indexing Created: 15/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Volker Mische Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I don't know whether this is an environmental problem or not. On Pavels performance run with 4 nodes, one node is slow, one half way slow and two normal. You can find the logs of the slow run here [1].

If you look at the current ShowFast graph [2] of the "Initial index (min), 1 bucket x 50M x 2KB, DGM, 4 x 1 views, no mutations" run ("Linux", "View Indexing" -> "Initial", second graph), it's way slower in the build 956 than in the 928 (46.1s vs. 22.6s). When looking at the logs, it's node *.31 that's way slower. It is either ep-engine not providing the UPR stream messages fast enough, or the view-engine consuming them slowly.

This node has been shown to be slow in several tests, so it might even be a problem in the environment (like a slow disk).

Here's the analysis from the 4 nodes, where you can see that one is clearly way slower. The numbers on the right are the seconds between the "Backfill complete" and "Stream closing" message, the left number is how often it occurred:

cat cbcollect_info_ns_1@172.23.100.31_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}' > /tmp/31
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.29_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    301 2
    208 3
      1 4
      1 5
      1 8
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.30_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    169 2
     87 3
     16 4
     82 5
    119 6
     28 7
      9 8
      2 9
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.31_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
      9 5
     41 6
    146 7
    124 8
     76 9
     67 10
     29 11
     15 12
      3 13
      1 14
      1 16
vmx@emil$ cat cbcollect_info_ns_1@172.23.100.32_20140714-125849/memcached.log|grep 'Backfill complete\|Stream closing'|grep '_design/A'|cut -d ' ' -f 4|xargs -I {} date --date={} +'%s'|awk '{p=$1; getline; print $1-p}'|sort -n|uniq -c
    317 2
    195 3

[1] http://localhost:3000/job/leto/298/
[2] http://showfast.sc.couchbase.com/#/timeline

 Comments   
Comment by Pavel Paulau [ 15/Jul/14 ]
MB-9822 ?
Comment by Volker Mische [ 15/Jul/14 ]
Forgot a link to the report: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-956_bba_build_init_index
Comment by Volker Mische [ 15/Jul/14 ]
Pavel, yes this sounds a lot like MB-9822. Though I wonder whether it is really an Erlang VM problem. That's what I try to find out. I guess I need to add some additional logging to see whether the view-engine doesn't reveive the items as fast as possible, or doesn't process them as fast as the other servers.
Comment by Volker Mische [ 15/Jul/14 ]
I can't find the UPR flow control spec, where is it (I forgot the details)?
Comment by Sriram Melkote [ 15/Jul/14 ]
https://docs.google.com/document/d/1xm43fPU0pO3EkN5xlePqBLiWy7O7f1MGz8TUHdaEZlU/edit?pli=1

Wish it was in github like other docs.
Comment by Mike Wiederhold [ 15/Jul/14 ]
Once the backfill is complete all of the items form the backfill are in memory. This means that the slowness you are reporting is from items that are already in memory. I would recommend checking the flow control logic and also looking for view engine slowness. If you suspect that the slowness is caused by ep-engine it would be good to get some information showing that messages sent per second are low or that there are large gaps in time between messages being sent.
Comment by Volker Mische [ 16/Jul/14 ]
After looking at the graphs from older builds, it really seems to be not specific to a single physical machine.

Next step is that the view-engine will get some stat which tracks how full the flow control buffer is. We hope this will give us some insights.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

this issue is form before happening more often now. performance issue. we should continue looking into.
Comment by Sriram Melkote [ 24/Jul/14 ]
Due to the impact of this on initial indexing, I'm raising this to a blocker
Comment by Volker Mische [ 30/Jul/14 ]
I found the issue, it is the CPU governor. On the slow node it the cpufreq kernel module wasn't loaded, on the others it was. If I set the CPU governor on the slow node to "performance" (it currently is still in that state), then the indexing is even faster than on the other nodes. It can e.g. be seen here [1] on the graph named [172.23.100.31] beam.smp_rss. The usage drops earlier than on the other nodes.

The other nodes do some CPU scaling, I saw different values when I did a `cat /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq`.

[1]: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1057_202_build_init_index
Comment by Volker Mische [ 30/Jul/14 ]
Pavel, I assign the issue to you. You can either keep using this issue or create a new one.

I propose to set all nodes to CPU governor "performance". I leave the configuration of the machines to you. If you need any help, let me know.

Once there's a run with an updated system and its results on ShowFast I'll close the issue.
Comment by Pavel Paulau [ 30/Jul/14 ]
Hi Volker,

+ your comment from email thread:
"Somehow now 172.23.100.30 is slow. The only way I could imagine that happened is that I started "cpuspeed", perhaps it changed some setting. Though I'm pretty sure the issue will go away, once everything is setup properly in regards to the cpu governors."

172.23.100.31 did demonstrate weird characteristics sometimes. I'm 100% sure that there is/was an environment issue.

But CPU governors don't explain slowness of other servers. They don't explain cases when 172.23.100.31 was using a lot of CPU resources.

I have nothing to do this ticket. Regular indexing tickets will be executed regardless this issues. It's up to you to follow results and close the ticket.
Comment by Volker Mische [ 30/Jul/14 ]
Pavel, can I take the environment again? I'm sure when I set the governor to "performance" on node 172.23.100.30 manually, it will just be fast again.

What I'm asking for is that *all* nodes use the same governor, so that they perform the same way ("performance is obviously preferred to get the best numbers).
Comment by Volker Mische [ 30/Jul/14 ]
I just read your comment again. Does it mean that I should open a new ticket that says that all nodes should use the same governor?
Comment by Volker Mische [ 30/Jul/14 ]
I should concentrate a bit more. In the issue I'm describing here, node 172.23.100.31 was never using a lot of CPU it was always using less CPU than other nodes.
Comment by Pavel Paulau [ 30/Jul/14 ]
I just disabled "cpuspeed' on all machines and started a set of regular tests.
I don't understand the need in CPU scaling on production-like servers.

+example of beam.smp on 172.23.100.31 with ~1700% CPU utilization: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1045_389_access
Comment by Volker Mische [ 30/Jul/14 ]
I fully agree that there's no need for CPU scaling, that's exactly what I was after. Thanks.

+in this example all nodes take about the same amount of CPU.
Comment by Pavel Paulau [ 30/Jul/14 ]
@Volker,

With disabled cpuspeed on all machines:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1059_66d_build_init_index
Comment by Volker Mische [ 30/Jul/14 ]
@Pavel, I just saw it. Please let me know whenever the cluster is free again for me to experiment with.

Now another node is low :(




[MB-11853] OS X app doesn't work if run from non-administrator user account Created: 30/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Jens Alfke Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
A user on the mobile mailing list reports that the Mac OS Couchbase Server app doesn't work when it's run from a non-administrator OS user account:

"It doesn't start from my non-admin account - message is displayed "You can't open the application "%@" because it may be damaged or incomplete." Yet Couchbase starts successfully from if I start it from the command line using an administrator account. In addition not all the menu options under the Couchbase icon work, e.g. Open Admin Console or About Couchbase Server."

https://groups.google.com/d/msgid/mobile-couchbase/5f53905e-bad1-4e51-b146-9d99f774506b%40googlegroups.com?utm_medium=email&utm_source=footer

It sounds as though some of the files in the bundle may have admin-only permissions?




[MB-11675] 40-50% performance degradation on append-heavy workload compared to 2.5.1 Created: 09/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: OS X Mavericks 10.9.3
CB server 3.0.0-918 (http://packages.northscale.com/latestbuilds/3.0.0/couchbase-server-enterprise_x86_64_3.0.0-918-rel.zip)
Haswell MacBook Pro (16GB RAM)

Attachments: PNG File CB 2.5.1 revAB_sim.png     PNG File CB 3.0.0-918 revAB_sim.png     JPEG File ep.251.jpg     JPEG File ep.300.jpg     JPEG File epso.251.jpg     JPEG File epso.300.jpg     Zip Archive MB-11675.trace.zip     Zip Archive perf_report_result.zip     Zip Archive revAB_sim_v2.zip     Zip Archive revAB_sim.zip    
Issue Links:
Relates to
relates to MB-11642 Intra-replication falling far behind ... In Progress
relates to MB-11623 test for performance regressions with... In Progress

 Description   
When running an append-heavy workload (modelling a social network address book, see below) the performance of CB has dropped from ~100K ops down to 50K ops compared to 2.5.1-1083 on OS X.

Edit: I see a similar (but slightly smaller - around 40% degradation on Linux (Ubuntu 14.04) - see comment below for details.

== Workload ==

revAB_sim - generates a model social network, then builds a representation of this in Couchbase. Keys are a set of phone numbers, values are lists of phone books which contain that phone number. (See attachment).

Configured for 8 client threads, 100,000 people (documents).

To run:

* pip install networkx
* Check revAB_sim.py for correct host, port, etc
* time ./revAB_sim.py

== Cluster ==

1 node, default bucket set to 1024MB quota.

== Runtimes for workload to complete ==


## CB-2.5.1-1083:

~107K op/s. Timings for workload (3 samples):

real 2m28.536s
real 2m28.820s
real 2m31.586s


## CB-3.0.0-918

~54K op/s. Timings for workload:

real 5m23.728s
real 5m22.129s
real 5m24.947s


 Comments   
Comment by Pavel Paulau [ 09/Jul/14 ]
I'm just curious, what does consume all CPU resources?
Comment by Dave Rigby [ 09/Jul/14 ]
I haven't had chance to profile it yet; certainly in both instances (fast / slow) the CPU is at 100% between the client workload and server.
Comment by Pavel Paulau [ 09/Jul/14 ]
Is memcached top consumer? or beam.smp? or client?
Comment by Dave Rigby [ 09/Jul/14 ]
memcached highest (as expected). From the 3.0.0 package (which I still have installed):

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG CMPRS VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW
34046 memcached 476.9 01:34.84 17/7 0 36 419 278M+ 277M+ 0B 0B 348M 2742M 34046 33801 running 501 73397+ 160 67 26 13304643+ 879+ 4070244+
34326 Python 93.4 00:18.57 9/1 0 25 418 293M+ 293M+ 0B 0B 386M 2755M 34326 1366 running 501 77745+ 399 70 28 15441263+ 629 5754198+
0 kernel_task 71.8 00:14.29 95/9 0 2 949 1174M+ 30M 0B 0B 295M 15G 0 0 running 0 42409 0 57335763+ 52435352+ 0 0 278127194+
...
32800 beam.smp 8.5 00:05.61 30/4 0 49 330 155M- 152M- 0B 0B 345M- 2748M- 32800 32793 running 501 255057+ 468 149 30 6824071+ 1862753+ 1623911+


Python is the workload generator.

I shall try to collect an Instruments profile of 3.0 and 2.5.1 to compare...
Comment by Dave Rigby [ 09/Jul/14 ]
Instruments profile of two runs:

Run 1: 3.0.0 (slow)
Run 2: 2.5.1 (fast)

I can look into the differences tomorrow if no-one else gets there first.


Comment by Dave Rigby [ 10/Jul/14 ]
Running on Linux (Ubuntu 14.04), 24 core Xeon, I see a similar effect, but the magnitude is not as bad - 40% performance drop.

100,000 documents with 4 worker threads, same bucket size (1024MB). (Note: worker threads was dropped to 4 as I couldn't get Python SDK to reliably connect with 8 threads at the same time).

## CB-3.0.0 (source build):

    83k op/s
    real 3m26.785s

## CB-2.5.1 (source build):

    133K op/s
    real 2m4.276s


Edit: Attached updated zip file as: revAB_sim_v2.zip
Comment by Dave Rigby [ 10/Jul/14 ]
Attaching the output of `perf report` for both 2.5.1 and 3.0.0 - perf_report_result.zip

There's nothing obvious jumping out at me, looks like quite a bit has changed between the two in ep_engine.
Comment by Dave Rigby [ 11/Jul/14 ]
I'm tempted to bump this to "blocker" considering it also affects Linux - any thoughts?
Comment by Pavel Paulau [ 11/Jul/14 ]
It's a product/release blocker, no doubt.

(though raising priority at this point will not move ticket to the top of the backlog due to other issues)
Comment by Dave Rigby [ 11/Jul/14 ]
@Pavel done :)
Comment by Abhinav Dangeti [ 11/Jul/14 ]
Think I should bring this up to people's notice that JSON detection has been moved to before items are set in memory, in 3.0. This could very well be the cause for this regression (as previously, we did do this JSON check but just before persistence).
This was part of the datatype related change, now required by UPR.
A HELLO protocol was introduced new in 3.0, which clients can invoke, there by letting the server know that clients would be setting the datatype themselves, in which case this JSON check wouldn't take place.
If a client doesn't invoke the HELLO command, then we would do JSON detection to set the datatype correctly.

However, recently, the HELLO was disabled as we weren't ready to handle compressed documents in view engine. This implied that we do a mandatory JSON check for every store operation, before setting the document even in memory.
Comment by Cihan Biyikoglu [ 11/Jul/14 ]
Thanks Abhinav. Can we try out if this simply resolves the issue quickly and if this is proven, revert this change?
Comment by David Liao [ 14/Jul/14 ]
I tried testing using the provided scripts with and without the json checking logic and there is no difference (on Mac and Ubuntu).

The total size of data is less than 200 MB with 100K items, it's about <2K per item which is not very big.
Comment by David Liao [ 15/Jul/14 ]
There might be an issue with general disk operation. I tested the set and it shows the same performance difference as append.
Pavel, have you seen any 'set' performance drop with 3.0? There is no rebalance involved just a single node in this test.
Comment by Pavel Paulau [ 16/Jul/14 ]
3.0 performs worse in CPU bound scenarios.
However Dave observed the same issue on system with 24 vCPU, which is kind of confusing to me.
Comment by Pavel Paulau [ 16/Jul/14 ]
Meanwhile I tried that script in my environment. I see no difference between and 2.5.1 and 3.0.

3.0.0-969: real 3m30.530s
2.5.1-1083: real 3m28.911s

Peak throughput is about 80K in both cases.

h/w configuration:

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

I used a standalone server as test client and regular packages.
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: I was essentially maxing out the system, so that probably explains why even with 24 cores I could see the issue.
Comment by Pavel Paulau [ 16/Jul/14 ]
Does it mean that 83/133K ops/sec saturate system with 24 cores?
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: yes (including the client workload generator which was running on the same machine). I could possibly push it higher by increasing the client worker threads, but as mentioned I had some python SDK connection issues then.

Comment by Pavel Paulau [ 16/Jul/14 ]
Weird, in my case CPU utilization was less than 500% (IRIX mode).
Comment by David Liao [ 16/Jul/14 ]
I am using a 4-core/4 GB ubuntu VM for the test.

3.0
real 11m16.530s
user 2m33.814s
sys 2m35.779s
<30k ops

2.5.1
real 7m6.843s
user 2m6.693s
sys 2m2.696s
40k ops


During today's test, I found out that the disk queue fill/drain rate of 3.0.0 is much smaller than 3.0.0 (<2k vs 30k). The cpu usage is ~8% higher too but most increase is from system cpu usage (total cpu is almost maxed out on 3.0)

Pavel, can you check the disk queue fill/drain rate of your test and system vs user cpu usage?
Comment by Pavel Paulau [ 16/Jul/14 ]
David,

I will check disk stats tmrw. At the time I would recommend you to run benchmark with disabled persistence.
Comment by Pavel Paulau [ 18/Jul/14 ]
In my case drain rate is higher in 2.5.1 (80K vs. 5K) but size of write queue and rate of actual disk creates/updates is pretty much the same.

CPU utilization is 2x higher in 3.0 (400% vs. 200%).

However I don't understand how this information helps.
Comment by David Liao [ 21/Jul/14 ]
The drain rate may not be accurate on 2.5.1.
 
'iostat' shows about 2x 'tps' and 'KB_wrtn/s' for 3.0.0 vs 2.5.1. So it indicates far more disk activities in 3.0.0.

We need to find out what the extra disk activities are. Since ep-engine issues "set" to couchstore which then write to disk, we should
do a benchmark against the couchstore separately to isolate problem.

Pavel, is there a way to do a couchstore performance test?
Comment by Pavel Paulau [ 22/Jul/14 ]
Due to increased number of flusher threads 3.0.0 persist data faster, that must explain higher disk activity.

Once again, disabling disk persistence at all will eliminate "disk" factor (just as an experiment).

Also I don't think that we made any critical changes in couchstore, I don't expect any regression. Chiyoung may have some benchmarks.
Comment by David Liao [ 22/Jul/14 ]
I have played with different flusher threads but don't see any improvement in my own not-designed-for-serious-performance-testing environment.

Logically, if flusher threads runs faster, it means the total number of transfer to disk should finish in shorter time. My observation is higher TPS lasted during the entire testing period which itself is much longer than 2.5.1 which means the total disk activities TPS and date_writte_disk for the same amount of work load is much higher.

Do you mean using memcached bucket when you say "disabling disk"? That test shows much less performance degradation which means majority of the problem is not from the memcached layer.

I am not familiar with couchstore changes but there are indeed quite a lot and not sure who is responsible for that component. But still it needs to be tested just like any other component.
Comment by Pavel Paulau [ 23/Jul/14 ]
I meant disabling persistence to disk in couchbase bucket. E.g., using cbepctl.
Comment by David Liao [ 23/Jul/14 ]
I disabled persistence with cbepctl and reran the tests and got the same performance degradation:

3.0.0:
real 6m3.988s
user 1m59.670s
sys 2m1.363s
ops: 50k

2.5.1
real 4m18.072s
user 1m45.940s
sys 1m39.775s
ops: 70k

So it's not the disk related operations that caused this.
Comment by David Liao [ 24/Jul/14 ]
Dave, what profiling tool did u use to collect the profiling data you attached?
Comment by Dave Rigby [ 24/Jul/14 ]
I used Linux perf - see for example http://www.brendangregg.com/perf.html
Comment by David Liao [ 25/Jul/14 ]
attach perf report for ep.so 3.0.0
Comment by David Liao [ 25/Jul/14 ]
perf report ep.so 2.51
Comment by David Liao [ 25/Jul/14 ]
I attached memcached and ep.so cpu usage for both 3.00 and 2.5.1.

The 2.5.1 didn't use c++ atomics. I tested 3.0.0 without c++ atomics and see the following improvement: ~20% diff.

Both with persistence disabled.

2,51
real 7m38.581s
user 2m11.771s
sys 2m27.968s
ops: 35k+

3.0.0
real 9m15.638s
user 2m31.642s
sys 2m56.154s
ops: ~30k

There could be multiple things that we still need to look at: the threading change in 3.0.0 and thus figuring out the best number of thread for different work load and also why there are much more data being written to disk in this work load.

I am using my laptop doing the perf testing but this kind of test should be done using dedicated/controlled testing environment.
So the perf team should try test the following areas:
1. c++ atomics change.
2. different threading configuration for different type of workload
3. independent couchstore testing decoupled from ep-engine.

Comment by Pavel Paulau [ 26/Jul/14 ]
As I mentioned before, I don't see difference between 251 and 300 using "dedicated/controlled testing environment.".

Anyways, thanks for your deep investigation. I will try to reproduce the issue on my laptop.

cc Thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
in both cases, append heavy workloads, where Sets are > 50Kops, performance degradation seen in early 3.0.0 builds. collateral symptoms were 1) increase in bytes written to store/disk, approximately 20%, 2) frequency of bucket compression (log shows bucket ranges being compressed overlap) 3) drop off of OPS over time.
starting with build 3.0.0-1037, these performance metrics are generally aligned/equivalent to 2.5.1. 1) frequency of bucket compression reduced 2) expansion of bytes written reduced to almost 1-1 3) OPS contention/slowdown does not occur.

test is 10 concurrent loaders, 1024 byte document (JSON or not-JSON) averaging ~80kOPS.
Comment by Dave Rigby [ 30/Jul/14 ]
TL;DR: 3.0 release debs appear to be built *without* optimisation (!)

On a hunch I thought I'd see how we are building 3.0.0, as it seemed a little surprising we saw symbols for C++ atomics as I would have expected them to be inlined. Looking at the build log [1], I see we are building the .deb package as Debug, without optimisation:

    (cd build && cmake -G "Unix Makefiles" -D CMAKE_INSTALL_PREFIX="/opt/couchbase" -D CMAKE_PREFIX_PATH=";/opt/couchbase" -D PRODUCT_VERSION=3.0.0-1059-rel -D BUILD_ENTERPRISE=TRUE -D CMAKE_BUILD_TYPE=Debug -D CB_DOWNLOAD_DEPS=1 ..)

Note: CMAKE_BUILD_TYPE=****Debug****

From my local Ubuntu build, I see that CXX flags are set to the following for each of Debug / Release / RelWithDebInfo:

    CMAKE_CXX_FLAGS_DEBUG:STRING=-g
    CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
    CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG

For comparision I checked the latest 2.5.1 build [2] (which may not be the same as the last 2.5.1 release) and I see we *did* compile that with -O3 - for example:

    libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -pipe -I./include -DHAVE_VISIBILITY=1 -fvisibility=hidden -I./src -I./include -I/opt/couchbase/include -pipe -O3 -O3 -ggdb3 -MT src/ep_la-ep_engine.lo -MD -MP -MF src/.deps/ep_la-ep_engine.Tpo -c src/ep_engine.cc -fPIC -DPIC -o src/.libs/ep_la-ep_engine.o


If someone from build / infrastructure could confirm that would be great, but all the evidence suggests we are building our release packages with no optimisation (!!)

I believe the solution here is to change the invocation of cmake to set CMAKE_BUILD_TYPE=Release.


[1]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-300-builder/builds/1100/steps/couchbase-server%20make%20enterprise%20/logs/stdio
[2]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-251-builder/builds/38/steps/couchbase-server%20make%20enterprise%20/logs/stdio
Comment by Dave Rigby [ 30/Jul/14 ]
Just checked RHEL - I see the same.

3.0:

    (cd build && cmake -G "Unix Makefiles" <cut> -D CMAKE_BUILD_TYPE=Debug <cut>

    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-300-builder/builds/1095/steps/couchbase-server%20make%20enterprise%20/logs/stdio


2.5.1:

    libtool: compile: g++ <cut> -O3 -c src/ep_engine.cc -o src/.libs/ep_la-ep_engine.o
    
    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/42/steps/couchbase-server%20make%20enterprise%20/logs/stdio






[MB-11840] 3.0 (Beta): Views periodically take 2 orders of magnitude longer to complete Created: 29/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0-Beta
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Daniel Owen Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: customer, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Single Node running version 3.0.0 enterprise edition (build-918). Running on VirtualBox, assigned 8 vCPUs and 8GB memory. (host has 24 cores, 128GB RAM).

Attachments: File backup.tgz     File curlscript.sh     PNG File output-251.png     PNG File output-3.0.png    
Issue Links:
Dependency

 Description   
Hi Alk,

I can demonstrate the behaviour of views periodically taking 2 orders of magnitude longer with 3.0.
(Similar to the issue we were investigating relating to Stats Archiver).

See output-3.0, x-axis is just a count of view queries. Test ran for ~53 minutes and completed 315408 view (~100 second). The y-axis is view response time (in seconds).

In general the response time is < 0.01 of a second. However occasionally (9 out of 315408 views) it takes > 0.1. This may be considered acceptable in the design of the Server, but wanted to get confirmation.

To replicate the test, run...

 while true; do ./curlscript.sh >> times2.txt 2>&1 ; done

I have provided curlscript.sh as an attached file.

The generated workload is test data from the same customer that hit the Stats Archive Issue
Create a bucket named "oogway" and then do a cbtransfer of the unpacked backup.tgz file (see attached).

 Comments   
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
What I'm supposed to do with that ?
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
CC-ed some folks.
Comment by Sriram Melkote [ 29/Jul/14 ]
Daniel - can you please let me know what is plotted on X and Y axis, and the unit for them?
Comment by Daniel Owen [ 29/Jul/14 ]
Hi Sriram, I have updated the description to contain more information. I'm just currently running a similar experiment on 2.5.1 and will upload when I get the results.
Comment by Daniel Owen [ 29/Jul/14 ]
I have uploaded data for a similar experiment performed on 2.5.1 (build-1083).
Again for ~53 minutes, we performed at total of 308193 queries (~100 per second) and a total of 15 out of 308193 took > 0.1 seconds to complete. In general response time is < 0.01

Note: That given the large CPU entitlement, we don't see any regular peak in view times due to the Stats Archive (i.e. not seeing regular spikes every 120 seconds) however we are still seeing very large spikes in view query response times (it appears more frequently than in 3.0 beta).
Comment by Daniel Owen [ 29/Jul/14 ]
I suspect the 2.5.1 results are worse than 3.0 because 2.5.1 is using Erlang version R14B04, and therefore as highlighted by Dave Rigby may be impacted by the bug OTP-11163.

See https://gist.github.com/chewbranca/07d9a6eed3da7b490b47#scheduler-collapse
Comment by Sriram Melkote [ 29/Jul/14 ]
A few points I'd like to note:

(a) There is no specified guarantee on time a query will take to respond; 300ms is not unusual response time for the odd case.
(b) It appears to be not a regression based on the 2.5 and 3.0 comparison graph
(c) Query layer is heavily in Erlang and we are already rewriting it. So I'm targeting this outside of 3.0
(d) I have no reason to think OTP-11163 is involved as there are many other reasons for Erlang VM to pause

I'm changing this back to a task as we need to investigate further to see if this behavior is indicative of an underlying bug before proceeding further.

+Ilam




[MB-10440] something isn't right with tcmalloc in build 1074 on at least rhel6 causing memcached to crash Created: 11/Mar/14  Updated: 30/Jul/14  Resolved: 30/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Phil Labee
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-10371 tcmalloc must be compiled with -DTCMA... Open
relates to MB-10439 Upgrade:: 2.5.0-1059 to 2.5.1-1074 =>... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
SUBJ.

Just installing latest 2.5.1 build on rhel6 and creating bucket caused segmentation fault (see also MB-10439).

When replacing tcmalloc with a copy I've built it works.

Cannot be 100% sure it's tcmalloc but crash looks too easily reproducible to be something else.


 Comments   
Comment by Wayne Siu [ 12/Mar/14 ]
Phil,
Can you review if this change has been (copied from MB-10371) applied properly?

voltron (2.5.1) commit: 73125ad66996d34e94f0f1e5892391a633c34d3f

    http://review.couchbase.org/#/c/34344/

passes "CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW" to each gprertools configure command
Comment by Andrei Baranouski [ 12/Mar/14 ]
see the same issue on centos 64
Comment by Phil Labee [ 12/Mar/14 ]
need more info:

1. What package did you install?

2. How did you build the tcmalloc which fixes the problem?
 
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
build 1740. Rhel6 package.

You can see yourself. It's easily reproducible as Andrei also confirmed too.

I've got 2.1 tar.gz from googlecode. And then did ./configure --prefix=/opt/couchbase --enable-minimal CPPFLAGS='-DTCMALLOC_SMALL_BUT_SLOW' and then make and make install. After that it works. Have no idea why.

Do you know exact CFLAGS and CXXFLAGS that are used to build our tcmalloc ? Those variables are likely set in voltron (or even from outside of voltron) and might affect optimization and therefore expose some bugs.

Comment by Aleksey Kondratenko [ 12/Mar/14 ]
And 64 bit.
Comment by Phil Labee [ 12/Mar/14 ]
We build out of:

    https://github.com/couchbase/gperftools

and for 2.5.1 use commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

compile using:

(cd /home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools \
&& ./autogen.sh \
        && ./configure --prefix=/opt/couchbase CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW --enable-minimal \
        && make \
        && make install-exec-am install-data-am)
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
That part I know. What I don't know is what cflags are being used.
Comment by Phil Labee [ 13/Mar/14 ]
from the 2.5.1 centos-6-x86 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x86-251-builder/builds/18/steps/couchbase-server%20make%20enterprise%20/logs/stdio

make[1]: Entering directory `/home/buildbot/buildbot_slave/centos-6-x86-251-builder/build/build/gperftools'

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -mmmx -fno-omit-frame-pointer -Wno-unused-result -march=i686 -mno-tls-direct-seg-refs -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Phil Labee [ 13/Mar/14 ]
from a 2.5.1 centos-6-x64 build log:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/16/steps/couchbase-server%20make%20enterprise%20/logs/stdio

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c -o libtcmalloc_minimal_la-tcmalloc.lo `test -f 'src/tcmalloc.cc' || echo './'`src/tcmalloc.cc

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -fPIC -DPIC -o .libs/libtcmalloc_minimal_la-tcmalloc.o

libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -I./src -DNO_TCMALLOC_SAMPLES -DTCMALLOC_SMALL_BUT_SLOW -DNO_TCMALLOC_SAMPLES -pthread -DNDEBUG -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -fno-builtin-malloc -fno-builtin-free -fno-builtin-realloc -fno-builtin-calloc -fno-builtin-cfree -fno-builtin-memalign -fno-builtin-posix_memalign -fno-builtin-valloc -fno-builtin-pvalloc -Wno-unused-result -DNO_FRAME_POINTER -O3 -ggdb3 -MT libtcmalloc_minimal_la-tcmalloc.lo -MD -MP -MF .deps/libtcmalloc_minimal_la-tcmalloc.Tpo -c src/tcmalloc.cc -o libtcmalloc_minimal_la-tcmalloc.o
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Ok. I'll try to exclude -O3 as possible reason of failure later today (in which case it might be upstream bug). In the meantime I suggest you to try lowering optimization to -O2. Unless you have other ideas of course.
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Building tcmalloc with exact same cflags -O3 doesn't cause any crashes. At this time my guess is either compiler bug or cosmic radiation hitting just this specific build.

Can we simply force rebuild ?
Comment by Phil Labee [ 13/Mar/14 ]
test with newer build 2.5.1-1075:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_2.5.1-1075-rel.rpm

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_centos6_x86_64_2.5.1-1075-rel.rpm
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Didn't help unfortunately. Is that still with -O3 ?
Comment by Phil Labee [ 14/Mar/14 ]
still using -O3. There are extensive comments in the voltron Makefile warning against changing to -O2
Comment by Phil Labee [ 14/Mar/14 ]
Did you try to build gperftools out of our repo?
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
The following is not true:

Got myself centos 6.4. And with it's gcc and -O3 I'm finally able to reproduce issue.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
So I've got myself centos 6.4 and _exact same compiler version_. And when I build tcmalloc myself with all right flags and replace tcmalloc from package it works. Without replacing it crashes.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Phil, please, clean ccache, reboot builder host (to clean page cache) and _then_ do another rebuild. Looking at build logs it looks like ccache is being used. So my suspicion about ram corruption is not fully excluded yet. And I have not much other ideas.
Comment by Phil Labee [ 14/Mar/14 ]
cleared ccache and restarted centos-6-x86-builder, centos-6-x64-builder

started build 2.5.1-1076
Comment by Pavel Paulau [ 14/Mar/14 ]
2.5.1-1076 seems to be working, it warns about "SMALL MEMORY MODEL IS IN USE, PERFORMANCE MAY SUFFER" as well.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Maybe I'm doing something wrong but it fails in exact same way on my VM
Comment by Pavel Paulau [ 14/Mar/14 ]
Sorry, it crashed eventually.
Comment by Aleksey Kondratenko [ 14/Mar/14 ]
Confirmed again. Everything is exactly same as before. Build 1076 centos 6.4 amd64 crashes very easily. Both enterprise edition and community. And doesn't crash if I replace tcmalloc with stuff that I've built, that's exact same source and exact same flags and exact same compiler version.

Build 1071 doesn't crash. All of the 100% consistently.
Comment by Phil Labee [ 17/Mar/14 ]
possibly a difference in build environment

reference env is described in voltron README.md file

for centos-6 X64 (6.4 final) we use the defaults for these tools:


gcc-4.4.7-3.el6 ( 4.4.7-4 available)
gcc-c++-4.4.7-3 ( 4.4.7-4 available)
kernel-devel-2.6.32-358 ( 2.6.32-431.5.1 available)
openssl-devel-1.0.0-27.el6_4.2 ( 1.0.1e-16.el6_5.4 available)
rpm-build-4.8.0-32 ( 4.8.0-37 available)

these tools do not have an update:

scons-2.0.1-1
libtool-2.2.6-15.5

For all centos these specific versions are installed:

gcc, g++ 4.4, currently 4.4.7-3, 4.4.7-4 available
autoconf 2.65, currently 2.63-5 (no update available)
automake 1.11.1
libtool 2.4.2
Comment by Phil Labee [ 17/Mar/14 ]
downloaded gperftools-2.1.tar.gz from

    http://gperftools.googlecode.com/files/gperftools-2.1.tar.gz

and expanded into directory: gperftools-2.1

cloned https://github.com/couchbase/gperftools.git at commit:

    674fcd94a8a0a3595f64e13762ba3a6529e09926

into directory gperftools, and compared:

=> diff -r gperftools-2.1 gperftools
Only in gperftools: .git
Only in gperftools: autogen.sh
Only in gperftools/doc: pprof.see_also
Only in gperftools/src/windows: TODO
Only in gperftools/src/windows: google

Only in gperftools-2.1: Makefile.in
Only in gperftools-2.1: aclocal.m4
Only in gperftools-2.1: compile
Only in gperftools-2.1: config.guess
Only in gperftools-2.1: config.sub
Only in gperftools-2.1: configure
Only in gperftools-2.1: depcomp
Only in gperftools-2.1: install-sh
Only in gperftools-2.1: libtool
Only in gperftools-2.1: ltmain.sh
Only in gperftools-2.1/m4: libtool.m4
Only in gperftools-2.1/m4: ltoptions.m4
Only in gperftools-2.1/m4: ltsugar.m4
Only in gperftools-2.1/m4: ltversion.m4
Only in gperftools-2.1/m4: lt~obsolete.m4
Only in gperftools-2.1: missing
Only in gperftools-2.1/src: config.h.in
Only in gperftools-2.1: test-driver
Comment by Phil Labee [ 17/Mar/14 ]
Since the build files in your source are different than in the production build, we can't really say we're using the same source.

Please build from our repo and re-try your test.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The difference is in autotools products. I _cannot_ build using same autotools that's present on build machine unless I'm given access to that box.
Comment by Aleksey Kondratenko [ 17/Mar/14 ]
The _source_ is exact same
Comment by Phil Labee [ 17/Mar/14 ]
I've given the versions of autotools to use, so you could make your build environment in line with the production builds.

As a shortcut, I've submitted a request for a clone of the builder VM that you can experiment with.

See CBIT-1053
Comment by Wayne Siu [ 17/Mar/14 ]
The cloned builder is available. Info in CBIT-1053.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Built tcmalloc from exact copy in builder directory.

Installed package from inside builder directory (build 1077). Verified that problem exists. Stopped service. Replaced tcmalloc. Observer that everything is fine.

Something in environment is causing this. Like maybe unusual ldflags or something else. But _not_ source.
Comment by Aleksey Kondratenko [ 18/Mar/14 ]
Build full rpm package under buildbot user. With exact same make invocation as I see in buildbot logs. And resultant package works. Weird indeed.
Comment by Phil Labee [ 18/Mar/14 ]
some differences between test build and production build:


1) In gperftools, production calls "make install-exec-am install-data-am" while test calls "make install" which executes extra step "all-am"

2) In ep-engine, produciton uses "make install" while test uses "make"

3) Test build as user "root" while production build as user "buildbot", so PATH and other env.vars may be different.

In general it's hard to tell what steps were performed for the test build, as no output logfiles have been captured.
Comment by Wayne Siu [ 21/Mar/14 ]
Updated from Phil:
comment:
________________________________________

2.5.1-1082 was done without the tcmalloc flag: CPPFLAGS=-DTCMALLOC_SMALL_BUT_SLOW

    http://review.couchbase.org/#/c/34755/


2.5.1-1083 was done with build step timeout increased from 60 minutes to 90

2.5.1-1084 was done with the tcmalloc flag restored:

    http://review.couchbase.org/#/c/34792/
Comment by Andrei Baranouski [ 23/Mar/14 ]
 2.5.1-1082 MB-10545 Vbucket map is not ready after 60 seconds
Comment by Meenakshi Goel [ 24/Mar/14 ]
Memcached crashes with segmentation fault is observed with build 2.5.1-1084-rel on ubuntu 12.04 during Auto Compaction tests.

Jenkins Link:
http://qa.sc.couchbase.com/view/2.5.1%20centos/job/centos_x64--00_02--compaction_tests-P0/56/consoleFull

root@jackfruit-s12206:/tmp# gdb /opt/couchbase/bin/memcached core.memcached.8276
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New LWP 8301]
[New LWP 8302]
[New LWP 8599]
[New LWP 8303]
[New LWP 8604]
[New LWP 8299]
[New LWP 8601]
[New LWP 8600]
[New LWP 8602]
[New LWP 8287]
[New LWP 8285]
[New LWP 8300]
[New LWP 8276]
[New LWP 8516]
[New LWP 8603]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
298 src/central_freelist.cc: No such file or directory.
(gdb) t a a bt

Thread 15 (Thread 0x7f3568039700 (LWP 8603)):
#0 0x00007f356f01b9fa in __lll_unlock_wake () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f018104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f018063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c663d6 in Mutex::release (this=0x5f68250) at src/mutex.cc:94
#4 0x00007f3569c9691f in unlock (this=<optimized out>) at src/locks.hh:58
#5 ~LockHolder (this=<optimized out>, __in_chrg=<optimized out>) at src/locks.hh:41
#6 fireStateChange (to=<optimized out>, from=<optimized out>, this=<optimized out>) at src/warmup.cc:707
#7 transition (force=<optimized out>, to=<optimized out>, this=<optimized out>) at src/warmup.cc:685
#8 Warmup::initialize (this=<optimized out>) at src/warmup.cc:413
#9 0x00007f3569c97f75 in Warmup::step (this=0x5f68258, d=..., t=...) at src/warmup.cc:651
#10 0x00007f3569c2644a in Dispatcher::run (this=0x5e7f180) at src/dispatcher.cc:184
#11 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5f68258) at src/dispatcher.cc:28
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f356a705700 (LWP 8516)):
#0 0x00007f356ed0d83d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ed3b774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3569c65445 in updateStatsThread (arg=<optimized out>) at src/memory_tracker.cc:31
#3 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f35703e8740 (LWP 8276)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e000, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e000, flags=<optimized out>) at event.c:1558
#3 0x000000000040c9e6 in main (argc=<optimized out>, argv=<optimized out>) at daemon/memcached.c:7996

Thread 12 (Thread 0x7f356c709700 (LWP 8300)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e280, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e280, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16814f8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f356e534700 (LWP 8285)):
#0 0x00007f356ed348bd in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356ecc8ff8 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f356ecca03e in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f356ecbe18a in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f356ecbd06b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f356e535b19 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f356d918700 (LWP 8287)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

#1 0x00007f356db32176 in logger_thead_main (arg=<optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f3567037700 (LWP 8602)):
#0 SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:32
#1 0x00007f3569c6351c in lock (this=<optimized out>) at src/atomic.hh:282
#2 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#3 gimme (this=<optimized out>) at src/atomic.hh:396
#4 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#5 KVShard::getBucket (this=0x7a6e7c0, id=256) at src/kvshard.cc:58
#6 0x00007f3569c9231d in VBucketMap::getBucket (this=0x614a448, id=256) at src/vbucketmap.cc:40
#7 0x00007f3569c314ef in EventuallyPersistentStore::getVBucket (this=<optimized out>, vbid=256, wanted_state=<optimized out>) at src/ep.cc:475
#8 0x00007f3569c315f6 in EventuallyPersistentStore::firePendingVBucketOps (this=0x614a400) at src/ep.cc:488
#9 0x00007f3569c41bb1 in EventuallyPersistentEngine::notifyPendingConnections (this=0x5eb8a00) at src/ep_engine.cc:3474
#10 0x00007f3569c41d63 in EvpNotifyPendingConns (arg=0x5eb8a00) at src/ep_engine.cc:1182
#11 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f3565834700 (LWP 8600)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7e1c0) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7e204) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f3566035700 (LWP 8601)):
#0 0x00007f356f0190fe in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f3569c68f7d in wait (tv=..., this=<optimized out>) at src/syncobject.hh:57
#2 ExecutorThread::run (this=0x5e7fa40) at src/scheduler.cc:146
#3 0x00007f3569c6963d in launch_executor_thread (arg=0x5e7fa84) at src/scheduler.cc:36
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f356cf0a700 (LWP 8299)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e500, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e500, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x1681400) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3567838700 (LWP 8604)):
#0 0x00007f356f01b89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f356f017065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f356f016eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f3569c6635a in Mutex::acquire (this=0x5e7f890) at src/mutex.cc:79
#4 0x00007f3569c261f8 in lock (this=<optimized out>) at src/locks.hh:48
#5 LockHolder (m=..., this=<optimized out>) at src/locks.hh:26
---Type <return> to continue, or q <return> to quit---
#6 Dispatcher::run (this=0x5e7f880) at src/dispatcher.cc:138
#7 0x00007f3569c26c1d in launch_dispatcher_thread (arg=0x5e7f898) at src/dispatcher.cc:28
#8 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f356af06700 (LWP 8303)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8e780, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8e780, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16817e0) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f3565033700 (LWP 8599)):
#0 0x00007f356ed18267 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3569c13997 in SpinLock::acquire (this=0x5ff7010) at src/atomic.cc:35
#2 0x00007f3569c63e57 in lock (this=<optimized out>) at src/atomic.hh:282
#3 SpinLockHolder (theLock=<optimized out>, this=<optimized out>) at src/atomic.hh:274
#4 gimme (this=<optimized out>) at src/atomic.hh:396
#5 RCPtr (other=..., this=<optimized out>) at src/atomic.hh:334
#6 KVShard::getVBucketsSortedByState (this=0x7a6e7c0) at src/kvshard.cc:75
#7 0x00007f3569c5d494 in Flusher::getNextVb (this=0x168d040) at src/flusher.cc:232
#8 0x00007f3569c5da0d in doFlush (this=<optimized out>) at src/flusher.cc:211
#9 Flusher::step (this=0x5ff7010, tid=21) at src/flusher.cc:152
#10 0x00007f3569c69034 in ExecutorThread::run (this=0x5e7e8c0) at src/scheduler.cc:159
#11 0x00007f3569c6963d in launch_executor_thread (arg=0x5ff7010) at src/scheduler.cc:36
#12 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f356b707700 (LWP 8302)):
#0 0x00007f356ed42353 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f356fdadf36 in epoll_dispatch (base=0x5e8ea00, tv=<optimized out>) at epoll.c:404
#2 0x00007f356fd99394 in event_base_loop (base=0x5e8ea00, flags=<optimized out>) at event.c:1558
#3 0x0000000000415584 in worker_libevent (arg=0x16816e8) at daemon/thread.c:301
#4 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f356bf08700 (LWP 8301)):
#0 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7f356f45d780) at src/central_freelist.cc:298
#1 0x00007f356f23ef19 in tcmalloc::CentralFreeList::FetchFromSpansSafe (this=0x7f356f45d780) at src/central_freelist.cc:283
#2 0x00007f356f23efb7 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f356f45d780, start=0x7f356bf07268, end=0x7f356bf07260, N=4) at src/central_freelist.cc:263
#3 0x00007f356f2430b5 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0xf5d298, cl=9, byte_size=128) at src/thread_cache.cc:160
#4 0x00007f356f239fa3 in Allocate (this=<optimized out>, cl=<optimized out>, size=<optimized out>) at src/thread_cache.h:364
#5 do_malloc_small (size=128, heap=<optimized out>) at src/tcmalloc.cc:1088
#6 do_malloc_no_errno (size=<optimized out>) at src/tcmalloc.cc:1095
#7 (anonymous namespace)::cpp_alloc (size=128, nothrow=<optimized out>) at src/tcmalloc.cc:1423
#8 0x00007f356f249538 in tc_new (size=139867476842368) at src/tcmalloc.cc:1601
#9 0x00007f3569c2523e in Dispatcher::schedule (this=0x5e7f880,
    callback=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, outtid=0x6127930, priority=...,
    sleeptime=<optimized out>, isDaemon=true, mustComplete=false) at src/dispatcher.cc:243
#10 0x00007f3569c84c1a in TapConnNotifier::start (this=0x6127920) at src/tapconnmap.cc:66
---Type <return> to continue, or q <return> to quit---
#11 0x00007f3569c42362 in EventuallyPersistentEngine::initialize (this=0x5eb8a00, config=<optimized out>) at src/ep_engine.cc:1415
#12 0x00007f3569c42616 in EvpInitialize (handle=0x5eb8a00,
    config_str=0x7f356bf07993 "ht_size=3079;ht_locks=5;tap_noop_interval=20;max_txn_size=10000;max_size=1491075072;tap_keepalive=300;dbname=/opt/couchbase/var/lib/couchbase/data/default;allow_data_loss_during_shutdown=true;backend="...) at src/ep_engine.cc:126
#13 0x00007f356cf0f86a in create_bucket_UNLOCKED (e=<optimized out>, bucket_name=0x7f356bf07b80 "default", path=0x7f356bf07970 "/opt/couchbase/lib/memcached/ep.so", config=<optimized out>,
    e_out=<optimized out>, msg=0x7f356bf07560 "", msglen=1024) at bucket_engine.c:711
#14 0x00007f356cf0faac in handle_create_bucket (handle=<optimized out>, cookie=0x5e4bc80, request=<optimized out>, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2168
#15 0x00007f356cf10229 in bucket_unknown_command (handle=0x7f356d1171c0, cookie=0x5e4bc80, request=0x5e44000, response=0x40d520 <binary_response_handler>) at bucket_engine.c:2478
#16 0x0000000000412c35 in process_bin_unknown_packet (c=<optimized out>) at daemon/memcached.c:2911
#17 process_bin_packet (c=<optimized out>) at daemon/memcached.c:3238
#18 complete_nread_binary (c=<optimized out>) at daemon/memcached.c:3805
#19 complete_nread (c=<optimized out>) at daemon/memcached.c:3887
#20 conn_nread (c=0x5e4bc80) at daemon/memcached.c:5744
#21 0x0000000000406e45 in event_handler (fd=<optimized out>, which=<optimized out>, arg=0x5e4bc80) at daemon/memcached.c:6012
#22 0x00007f356fd9948c in event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1308
#23 event_process_active (base=<optimized out>) at event.c:1375
#24 event_base_loop (base=0x5e8ec80, flags=<optimized out>) at event.c:1572
#25 0x0000000000415584 in worker_libevent (arg=0x16815f0) at daemon/thread.c:301
#26 0x00007f356f014e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f356ed41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb)
Comment by Aleksey Kondratenko [ 25/Mar/14 ]
Yesterday I took that consistently failing ubuntu build and played with it on my box.

It is exactly same situation. Replacing libtcmalloc.so makes it work.

So I've spent afternoon on running what's in our actual package under debugger.

I found several evidences that some object files linked into libtcmalloc.so that we ship were built with -DTCMALLOC_SMALL_BUT_SLOW and some _were_ not.

That explains weird crashes.

I'm unable to explain how it's possible that our builders produced such .so files. Yet.

Gut feeling is that it might be:

* something caused by ccache

* perhaps not full cleanup between builds

In order to verify that I'm asking the following:

* do a build with ccache completely disabled but with define

* do git clean -xfd inside gperftools checkout before doing build

Comment by Phil Labee [ 29/Jul/14 ]
The failure was detected by

    http://qa.sc.couchbase.com/job/centos_x64--00_02--compaction_tests-P0/

Can I run this test on a 3.0.0 build to see if this bug still exists?
Comment by Phil Labee [ 29/Jul/14 ]
Can I run this test on a 3.0.0 build to see if bug still exists?
Comment by Meenakshi Goel [ 30/Jul/14 ]
Started a run with latest 3.0.0 build 1057.
http://qa.hq.northscale.net/job/centos_x64--44_01--auto_compaction_tests-P0/37/console

However haven't seen such crashes with compaction tests during 3.0.0 testing.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Tests passed with 3.0.0-1057-rel.




[MB-11779] Memory underflow in updates-only scenario with 5 buckets Created: 21/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/view/lab/job/perf-dev/503/artifact/
Is this a Regression?: Yes

 Description   
Essentially re-opened MB-11661.

2 nodes, 5 buckets, 200K x 1KB docs per bucket (non-DGM), 2K updates per bucket.

Mon Jul 21 13:24:34.955935 PDT 3: (bucket-1) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...

 Comments   
Comment by Sriram Ganesan [ 22/Jul/14 ]
Pavel

How often would you say this reproduces in your environment? I tried this locally a few times and didn't hit this.
Comment by Pavel Paulau [ 23/Jul/14 ]
Pretty much every time.

It usually takes >10 hours before test encounters GIGANTOR failure. But slowly decreasing mem_used obviously indicates the issue.
Comment by Pavel Paulau [ 26/Jul/14 ]
Just spotted again in different scenario, build 3.0.0-1024. Proof: https://s3.amazonaws.com/bugdb/jira/MB-11779/172.23.96.11.zip .
Comment by Sriram Ganesan [ 28/Jul/14 ]
Pavel

Thanks for uploading those logs. I see a bunch vbucket deletion messages in the test

Fri Jul 25 07:33:16.745484 PDT 3: (bucket-10) Deletion of vbucket 1023 was completed.
Fri Jul 25 07:33:16.745619 PDT 3: (bucket-10) Deletion of vbucket 1022 was completed.
Fri Jul 25 07:33:16.745739 PDT 3: (bucket-10) Deletion of vbucket 1021 was completed.
Fri Jul 25 07:33:16.745887 PDT 3: (bucket-10) Deletion of vbucket 1020 was completed.
Fri Jul 25 07:33:16.746005 PDT 3: (bucket-10) Deletion of vbucket 1019 was completed.
Fri Jul 25 07:33:16.746177 PDT 3: (bucket-10) Deletion of vbucket 1018 was completed.

This seems to the case for all the buckets. But the GIGANTOR message only shows up for 5 of the buckets. Are these logs from the same test? Are you doing any forced shutdown of any of the buckets in your test? Apparently there is a known issue in ep-engine according to Chiyoung at bucket shutdown time and the GIGANTOR message can manifest only affecting the bucket that is shutdown.
Comment by Sriram Ganesan [ 28/Jul/14 ]
Also, please confirm if any rebalance operations were done in the logs uploaded on the 25th.
Comment by Pavel Paulau [ 28/Jul/14 ]
Sriram,

Logs are from different test/setup (with 10 buckets).

There was only one rebalance event during initial cluster setup:

2014-07-25 07:33:07.970 ns_orchestrator:4:info:message(ns_1@172.23.96.11) - Starting rebalance, KeepNodes = ['ns_1@172.23.96.11','ns_1@172.23.96.12',
                                 'ns_1@172.23.96.13','ns_1@172.23.96.14'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
2014-07-25 07:33:07.995 ns_orchestrator:1:info:message(ns_1@172.23.96.11) - Rebalance completed successfully.

10 buckets were created after that:

2014-07-25 07:33:13.674 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:13.784 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.006 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.031 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.082 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.14' in 1 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.385 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.682 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.11' in 1 seconds.
2014-07-25 07:33:15.107 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.12' in 1 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.218 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.305 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.312 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.610 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.716 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.802 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.811 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.11' in 0 seconds.

Basically bucket shutdown wasn't forced. All those operations are quite normal.

Also from logs I can see underflow issue only in "bucket-10".
Comment by Pavel Paulau [ 30/Jul/14 ]
Hi Sriram,

I can start the test which will reproduce the issue. Will live cluster help?
Comment by Sriram Ganesan [ 30/Jul/14 ]
Pavel

I was planning on providing a toy build today. I need to do more local testing in my environment before I can provide it. The current theory is that the root cause actually happens much earlier itself messing up the accounting and eventually leads to an underflow. I shall try to do so before noon today.




[MB-11843] {UPR} :: View Query timesout after a rebalance-in-out operation Created: 29/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Sarath Lakshman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 1:172.23.107.210
2:172.23.107.211
3:172.23.107.212
4:172.23.107.213
5:172.23.107.214
6:172.23.107.215

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1046, centos 6x

1. Create 5 node cluster
2. Create default bucket
3. Add 500 K Items
4. Create 3 ddocs with 3 views, start indexing and querying
5. Rebalance-in-out (out 2 node, in 1 node)
6. Query the views

Hit the exception

2014-07-29 05:38:01,573] - [rest_client:484] INFO - index query url: http://172.23.107.210:8092/default/_design/ddoc0/_view/view2?connectionTimeout=60000&full_set=true
ERROR
[('/usr/lib/python2.7/threading.py', 524, '__bootstrap', 'self.__bootstrap_inner()'), ('/usr/lib/python2.7/threading.py', 551, '__bootstrap_inner', 'self.run()'), ('./testrunner.py', 262, 'run', '**self._Thread__kwargs)'), ('/usr/lib/python2.7/unittest/runner.py', 151, 'run', 'test(result)'), ('/usr/lib/python2.7/unittest/case.py', 391, '__call__', 'return self.run(*args, **kwds)'), ('/usr/lib/python2.7/unittest/case.py', 327, 'run', 'testMethod()'), ('pytests/rebalance/rebalanceinout.py', 501, 'measure_time_index_during_rebalance', 'tasks[task].result(self.wait_timeout)'), ('lib/tasks/future.py', 162, 'result', 'self.set_exception(TimeoutError())'), ('lib/tasks/future.py', 264, 'set_exception', 'print traceback.extract_stack()')]

Looking at couchdb logs for 1 node

[couchdb:error,2014-07-29T10:24:25.367,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29759.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:25.572,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29762.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:25.821,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29770.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:27.556,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29821.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:27.685,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29827.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.105,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29840.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.575,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29852.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.805,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29857.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:28.985,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29875.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.143,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29878.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.393,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29881.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:29.533,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29894.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.040,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29910.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.177,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29913.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.333,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29918.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.524,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29925.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.687,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29937.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.802,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29945.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:30.994,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29956.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.160,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29960.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.325,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29963.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.455,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29966.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.556,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29969.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.719,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29972.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:24:31.831,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Cleanup process <0.29975.1> for set view `default`, replica (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:25:13.438,ns_1@10.3.121.63:<0.19295.1>:couch_log:error:44]Cleanup process <0.30517.1> for set view `default`, main (prod) group `_design/ddoc0`, died with reason: stopped
[couchdb:error,2014-07-29T10:25:46.471,ns_1@10.3.121.63:<0.19325.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/ddoc0 (prod/replica)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-29T10:25:46.471,ns_1@10.3.121.63:<0.19307.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/ddoc0 (prod/main)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-29T10:25:48.477,ns_1@10.3.121.63:<0.19313.1>:couch_log:error:44]Set view `default`, replica (prod) group `_design/ddoc0`, UPR process <0.19325.1> died with unexpected reason: vbucket_stream_not_found
[couchdb:error,2014-07-29T10:25:48.479,ns_1@10.3.121.63:<0.19295.1>:couch_log:error:44]Set view `default`, main (prod) group `_design/ddoc0`, UPR process <0.19307.1> died with unexpected reason: vbucket_stream_not_found


Test Case:: ./testrunner -i centos_x64_rebalance_in_out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.measure_time_index_during_rebalance,items=500000,data_perc_add=50,nodes_init=5,nodes_in=1,nodes_out=2,num_ddocs=3,num_views=3,max_verify=50000,GROUP=IN_OUT;P1;FROM_2_0;PERFORMANCE


 Comments   
Comment by Parag Agarwal [ 29/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11843/1046_log.tar.gz
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
This can go directly to view engine IMO.
Comment by Meenakshi Goel [ 30/Jul/14 ]
Observing similar issue during Views tests with latest build 3.0.0-1057-rel
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/129/consoleFull
Comment by Sarath Lakshman [ 30/Jul/14 ]
I have identified the problem and will be posting a fix by tomorrow.




[MB-11852] [System test] Memcached crashes during initial loading Created: 30/Jul/14  Updated: 30/Jul/14  Resolved: 30/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Ketaki Gangal Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1050-rel

Attachments: File stack_10.6.2.172.rtf    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Setup a cluster , 2 Buckets, 1 ddoc X 2 Views
2. Load 120M, 107M items on the buckets.

- Seeing a number of memcached crashes across couple of nodes.
This is a first time the test is failing on the intial phases.

Attached stack trace from one of the nodes.
Attaching collect_info.

 

 Comments   
Comment by Mike Wiederhold [ 30/Jul/14 ]
http://review.couchbase.org/#/c/40017/
http://review.couchbase.org/#/c/40019/




[MB-11849] couch_view_index_updater crashes (Segmentation fault) during test with stale=false queries Created: 29/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1045

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = 2 x SSD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/leto/407/artifact/
Is this a Regression?: Yes

 Description   
Backtraces are attached.

There are many couchdb errors in logs as well.

 Comments   
Comment by Sriram Melkote [ 29/Jul/14 ]
Nimish, it appears like it may be the exit_thread_helper problem again in a different part of the code?
Comment by Pavel Paulau [ 30/Jul/14 ]
I tried more recent build (1057), segfault didn't happen (logs: http://ci.sc.couchbase.com/job/leto/412/artifact/).
Most likely the issue is occasional.
Comment by Nimish Gupta [ 30/Jul/14 ]
Hi, This is not a problem with exit_thread_helper. Looks like updater is not able to open a sort file. There is no information in the logs regarding that sort file. Without core dump, I feel it is not possible to know the exact reason for failure in opening the file.
 
Due to failure in opening the file, error case was getting executed and there was a minor bug, it crashed. I have fixed that minor bug and code is in review (http://review.couchbase.org/40052).

Pavel, couch you please attach the core dump of the couch_view_index_updater ?


Comment by Pavel Paulau [ 30/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11849/core.couch_view_inde.27157.leto-s303.1406654330




[MB-10204] Time zone should be displayed in interface for Auto-Compaction Settings Created: 13/Feb/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Kirk Kirkconnell Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-07-22 at 4.54.58 PM.png     PNG File Screen Shot 2014-07-22 at 4.55.18 PM.png    

 Description   
On the Auto-Compaction setting page of the Admin Web UI, you can set a date time, but no where does it say what timezone it is in. This needs to be definitive. Is this server time? Local to the user? UTC?

 Comments   
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
Lets add text that this is server tz. We cannot portably display this timezone however. But there's a very old story about being able to deal with timezones inside erlang (for xdcr schedules).
Comment by Pavel Blagodov [ 18/Jul/14 ]
http://review.couchbase.org/39537
Comment by Aleksey Kondratenko [ 18/Jul/14 ]
The text in patch wasn't exactly clear. So I've asked Anil to help. And we also discussed some maybe larger changes to UI of compaction settings. So Anil promised some feedback in terms of more concrete mockups.
Comment by Anil Kumar [ 22/Jul/14 ]
Pavel - I have attached the mockup for Auto Compaction settings page. Let me know if its not clear.
Comment by Pavel Blagodov [ 30/Jul/14 ]
http://review.couchbase.org/39537




[MB-10921] Possibly file descriptor leak? Created: 22/Apr/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Trond Norbye Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File df_output.rtf     File du_output.rtf     File ls_delete.rtf     File lsof_10.6.2.164.rtf     File lsof_beam.rtf     File ls_output.rtf    
Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I ran df and du on that server and noticed similar figures (full console log at the end of this email). du reports 68GB on /var/opt/couchbase, whereas df reports ~140GB of disk usage. The lsof command shows that there are several files which have been deleted but are still opened by beam.smp. Those files are in /var/opt/couchbase/.delete/, and their total size amounts to the “Other Data” (roughly 70GB).
 
I’ve never noticed that before, yet recently we started playing with CB views. I wonder if that can be related. Also note that at the time I did the investigation, there had been no activity on the cluster for several hours: no get/set on the buckets, and no compaction or indexing was ongoing.
Are you aware of this problem? What can we do about it?

beam.smp 55872 couchbase 19u REG 8,17 16136013537 4849671 /var/opt/couchbase/.delete/babe701b000ce862e58ca2edd1b8098b (deleted)
beam.smp 55872 couchbase 34u REG 8,17 1029765807 4849668 /var/opt/couchbase/.delete/5c6df85a423263523471f6e20d82ce07 (deleted)
beam.smp 55872 couchbase 51u REG 8,17 1063802330 4849728 /var/opt/couchbase/.delete/c2a11ea6f3e70f8d222ceae9ed482b13 (deleted)
beam.smp 55872 couchbase 55u REG 8,17 403075242 4849667 /var/opt/couchbase/.delete/6af0b53325bf4f2cd1df34b476ee4bb6 (deleted)
beam.smp 55872 couchbase 56r REG 8,17 403075242 4849667 /var/opt/couchbase/.delete/6af0b53325bf4f2cd1df34b476ee4bb6 (deleted)
beam.smp 55872 couchbase 57u REG 8,17 861075170 4849666 /var/opt/couchbase/.delete/72a08b8a613198cd3a340ae15690b7f1 (deleted)
beam.smp 55872 couchbase 58r REG 8,17 861075170 4849666 /var/opt/couchbase/.delete/72a08b8a613198cd3a340ae15690b7f1 (deleted)
beam.smp 55872 couchbase 59r REG 8,17 1029765807 4849668 /var/opt/couchbase/.delete/5c6df85a423263523471f6e20d82ce07 (deleted)
beam.smp 55872 couchbase 60r REG 8,17 896931996 4849672 /var/opt/couchbase/.delete/3b1b7aae4af60e9e720ad0f0d3c0182c (deleted)
beam.smp 55872 couchbase 63r REG 8,17 976476432 4849766 /var/opt/couchbase/.delete/6f5736b1ed9ba232084ee7f0aa5bd011 (deleted)
beam.smp 55872 couchbase 66u REG 8,17 18656904860 4849675 /var/opt/couchbase/.delete/fcaf4193727374b471c990a017a20800 (deleted)
beam.smp 55872 couchbase 67u REG 8,17 662227221 4849726 /var/opt/couchbase/.delete/4e7bbc192f20def5d99447b431591076 (deleted)
beam.smp 55872 couchbase 70u REG 8,17 896931996 4849672 /var/opt/couchbase/.delete/3b1b7aae4af60e9e720ad0f0d3c0182c (deleted)
beam.smp 55872 couchbase 74r REG 8,17 662227221 4849726 /var/opt/couchbase/.delete/4e7bbc192f20def5d99447b431591076 (deleted)
beam.smp 55872 couchbase 75u REG 8,17 1896522981 4849670 /var/opt/couchbase/.delete/3ce0c5999854691fe8e3dacc39fa20dd (deleted)
beam.smp 55872 couchbase 81u REG 8,17 976476432 4849766 /var/opt/couchbase/.delete/6f5736b1ed9ba232084ee7f0aa5bd011 (deleted)
beam.smp 55872 couchbase 82r REG 8,17 1063802330 4849728 /var/opt/couchbase/.delete/c2a11ea6f3e70f8d222ceae9ed482b13 (deleted)
beam.smp 55872 couchbase 83u REG 8,17 1263063280 4849673 /var/opt/couchbase/.delete/e06facd62f73b20505d2fdeab5f66faa (deleted)
beam.smp 55872 couchbase 85u REG 8,17 1000218613 4849767 /var/opt/couchbase/.delete/0c4fb6d5cd7d65a4bae915a4626ccc2b (deleted)
beam.smp 55872 couchbase 87r REG 8,17 1000218613 4849767 /var/opt/couchbase/.delete/0c4fb6d5cd7d65a4bae915a4626ccc2b (deleted)
beam.smp 55872 couchbase 90u REG 8,17 830450260 4849841 /var/opt/couchbase/.delete/7ac46b314e4e30f81cdf0cd664bb174a (deleted)
beam.smp 55872 couchbase 95r REG 8,17 1263063280 4849673 /var/opt/couchbase/.delete/e06facd62f73b20505d2fdeab5f66faa (deleted)
beam.smp 55872 couchbase 96r REG 8,17 1896522981 4849670 /var/opt/couchbase/.delete/3ce0c5999854691fe8e3dacc39fa20dd (deleted)
beam.smp 55872 couchbase 97u REG 8,17 1400132620 4849719 /var/opt/couchbase/.delete/e8eaade7b2ee5ba7a3115f712eba623e (deleted)
beam.smp 55872 couchbase 103r REG 8,17 16136013537 4849671 /var/opt/couchbase/.delete/babe701b000ce862e58ca2edd1b8098b (deleted)
beam.smp 55872 couchbase 104u REG 8,17 1254021993 4849695 /var/opt/couchbase/.delete/f77992cdae28194411b825fa52c560cd (deleted)
beam.smp 55872 couchbase 105r REG 8,17 1254021993 4849695 /var/opt/couchbase/.delete/f77992cdae28194411b825fa52c560cd (deleted)
beam.smp 55872 couchbase 106r REG 8,17 1400132620 4849719 /var/opt/couchbase/.delete/e8eaade7b2ee5ba7a3115f712eba623e (deleted)
beam.smp 55872 couchbase 108u REG 8,17 1371453421 4849793 /var/opt/couchbase/.delete/9b8b199920075102e52742c49233c57c (deleted)
beam.smp 55872 couchbase 109r REG 8,17 1371453421 4849793 /var/opt/couchbase/.delete/9b8b199920075102e52742c49233c57c (deleted)
beam.smp 55872 couchbase 111r REG 8,17 18656904860 4849675 /var/opt/couchbase/.delete/fcaf4193727374b471c990a017a20800 (deleted)
beam.smp 55872 couchbase 115u REG 8,17 16442158432 4849708 /var/opt/couchbase/.delete/2b70b084bd9d0a1790de9b3ee6c78f69 (deleted)
beam.smp 55872 couchbase 116r REG 8,17 16442158432 4849708 /var/opt/couchbase/.delete/2b70b084bd9d0a1790de9b3ee6c78f69 (deleted)
beam.smp 55872 couchbase 151r REG 8,17 830450260 4849841 /var/opt/couchbase/.delete/7ac46b314e4e30f81cdf0cd664bb174a (deleted)
beam.smp 55872 couchbase 181u REG 8,17 770014022 4849751 /var/opt/couchbase/.delete/d35ac74521ae4c1d455c60240e1c41e1 (deleted)
beam.smp 55872 couchbase 182r REG 8,17 770014022 4849751 /var/opt/couchbase/.delete/d35ac74521ae4c1d455c60240e1c41e1 (deleted)
beam.smp 55872 couchbase 184u REG 8,17 775017865 4849786 /var/opt/couchbase/.delete/2a85b841a373ee149290b0ec906aae55 (deleted)
beam.smp 55872 couchbase 185r REG 8,17 775017865 4849786 /var/opt/couchbase/.delete/2a85b841a373ee149290b0ec906aae55 (deleted)

 Comments   
Comment by Volker Mische [ 22/Apr/14 ]
Filipe, could you have a look at this?

I also found in the bug tracker an issue that we needed to patch Erlang because of file descriptor leaks (CBD-753 [1]). Could it be related?

[1]: http://www.couchbase.com/issues/browse/CBD-753
Comment by Trond Norbye [ 22/Apr/14 ]
From the comments in that bug that seems to be blocker for 2.1 testing and this is 2.2...
Comment by Volker Mische [ 22/Apr/14 ]
Trond, though Erlang was patched, so it's independent of the Couchbase version, but depends on Erlang. Though I guess you use Erlang >= R16 anyway (which should have that patch).

Could you also try it with 2.5? Perhaps it has been fixed already.
Comment by Filipe Manana [ 22/Apr/14 ]
There's no useful information here to work on or conclude anything.

First of all, it may be database files. Both database and view files are renamed to uuids and moved to .delete directory. And before 3.0, database compaction is orchestrated in erlang land (rename + delete).

Second, we had in the past such leaks, one caused by Erlang itself (whence a patched R14/R15 is needed, or R16 unpatched) and others caused by CouchDB upstream code, which got fixed before 2.0 (and in Apache CouchDB 1.2) - geocouch is based on a copy of CouchDB's view engine that is much older than Apache CouchDB 1.x whence suffers the same leaks issue (not closing files after compactions amongst other cases).

Given there's no concrete steps to reproduce this, nor has anyone observed this recently, I can't exclude the possibility of him using the geo/spatial views or running an unpatched Erlang.
Comment by Trond Norbye [ 22/Apr/14 ]
Volker: We ship a bundled erlang in our releases don't we?

Filipe: I forwarded the email to you, Volker and Alk April the 8th with all the information I had. We can ask back for more information if that helps us pinpoint where it is..
Comment by Volker Mische [ 22/Apr/14 ]
Trond: I wasn't aware that the "I" isn't you :)

I would ask to try it again on 2.5.1 and if it's still there the steps to reproduce it.
Comment by Trond Norbye [ 22/Apr/14 ]
The customer have the system running. Are we sure there is no commands to run on the erlang thing to gather more information on the current state?
Comment by Sriram Melkote [ 06/Jun/14 ]
Ketaki, can we please make sure in 3.0 tests:

(a) Number of open file descriptors does not keep growing
(b) The files in .delete directory get cleaned up eventually
(c) Disk space does not keep growing

If both are true in long running tests, we can close this for 3.0
Comment by Sriram Melkote [ 16/Jun/14 ]
I'm going to close this as we've not seen evidence of fd leaks in R16 so far. If system tests encounter fd leak, please reopen.
Comment by Sriram Admin [ 01/Jul/14 ]
Reopen as we're seeing it in another place, CBSE-1247 making it likely this is a product (and not environment) issue
Comment by Sriram Melkote [ 03/Jul/14 ]
Nimish, can we:

(a) Give a specific pattern (i.e., view-{uuid} or something) so we can distinguish KV files from View files after moving to .delete
(b) Can we add a log message to count number of entries and size of the .delete directory


This will help us see if we're accumulating .delete files during our system testing.
Comment by Nimish Gupta [ 09/Jul/14 ]
Hi, It can due be due views may not be closing the fds before deleting the files. I have added a debug message to log the filename when we call delete (http://review.couchbase.org/#/c/39233/). Ketaki, could you please try to reproduce the issue with the latest build.
Comment by Sriram Melkote [ 09/Jul/14 ]
Ketaki, can you please attach logs from a system test run with rebalance etc, with the above change merged? This will help us understand how to fix the problem better.
Comment by Ketaki Gangal [ 09/Jul/14 ]
Yes, will do. The above changes are a part of build 3.0.0-943-rel. With next run of system tests, I will update this bug.
Comment by Nimish Gupta [ 09/Jul/14 ]
Ketaki, Please attach the output of "ls -l" also in /var/opt/couchbase/.delete directory after running the test.
Comment by Ketaki Gangal [ 11/Jul/14 ]
- Run on build 3.0.0-943-re.
- Attached information from System cluster here

From all nodes
1. ls -l from " /opt/couchbase/var/lib/couchbase/data/.delete/"* : its zero all across
2. df
3.du
4. lsof from one of the nodes and below lsof beam.smp - dont see anything unsual
5. Collect info from all the nodes, however a cursory grep on "Deleting couch file " did not yield any output.

* Also the disk usage/ pattern seems consistent and expected on the current runs.

This cluster has had the following rebalance related operations
- Rebalance In 1
- Swap Rebalance
- Rebalance Out

Logs from the cluster https://s3.amazonaws.com/bugdb/MB-10921/10921.tar
Comment by Ketaki Gangal [ 11/Jul/14 ]
I dont see anything usual with the output below however --

[root@centos-64-x64 fd]# pidof beam.smp
1330 1192 1134

[root@centos-64-x64 fd]# lsof -a -p 1330
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
beam.smp 1330 couchbase cwd DIR 253,0 4096 529444 /opt/couchbase/var/lib/couchbase
beam.smp 1330 couchbase rtd DIR 253,0 4096 2 /
beam.smp 1330 couchbase txt REG 253,0 54496893 523964 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp
beam.smp 1330 couchbase mem REG 253,0 165929 525131 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto_callback.so
beam.smp 1330 couchbase mem REG 253,0 88600 392512 /lib64/libz.so.1.2.3
beam.smp 1330 couchbase mem REG 253,0 1408384 1183516 /usr/lib64/libcrypto.so.0.9.8e
beam.smp 1330 couchbase mem REG 253,0 476608 525130 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto.so
beam.smp 1330 couchbase mem REG 253,0 135896 392504 /lib64/libtinfo.so.5.7
beam.smp 1330 couchbase mem REG 253,0 1916568 392462 /lib64/libc-2.12.so
beam.smp 1330 couchbase mem REG 253,0 43832 392490 /lib64/librt-2.12.so
beam.smp 1330 couchbase mem REG 253,0 142464 392486 /lib64/libpthread-2.12.so
beam.smp 1330 couchbase mem REG 253,0 140096 392500 /lib64/libncurses.so.5.7
beam.smp 1330 couchbase mem REG 253,0 595688 392470 /lib64/libm-2.12.so
beam.smp 1330 couchbase mem REG 253,0 19536 392468 /lib64/libdl-2.12.so
beam.smp 1330 couchbase mem REG 253,0 14584 392494 /lib64/libutil-2.12.so
beam.smp 1330 couchbase mem REG 253,0 154464 392455 /lib64/ld-2.12.so
beam.smp 1330 couchbase 0r FIFO 0,8 0t0 11761 pipe
beam.smp 1330 couchbase 1w FIFO 0,8 0t0 11760 pipe
beam.smp 1330 couchbase 2w FIFO 0,8 0t0 11760 pipe
beam.smp 1330 couchbase 3u REG 0,9 0 3696 anon_inode
beam.smp 1330 couchbase 4r FIFO 0,8 0t0 11828 pipe
beam.smp 1330 couchbase 5w FIFO 0,8 0t0 11828 pipe
beam.smp 1330 couchbase 6r FIFO 0,8 0t0 11829 pipe
beam.smp 1330 couchbase 7w FIFO 0,8 0t0 11829 pipe
beam.smp 1330 couchbase 8w REG 253,0 8637 529519 /opt/couchbase/var/lib/couchbase/logs/ssl_proxy.log
beam.smp 1330 couchbase 9u IPv4 12253 0t0 TCP *:11214 (LISTEN)
beam.smp 1330 couchbase 10u IPv4 12255 0t0 TCP localhost:11215 (LISTEN)
Comment by Nimish Gupta [ 14/Jul/14 ]
Ketaki, Could you please run the test for longer duration (e.g. 2-3 days) and check for the number of files in .delete directory. Please upload the logs also if you see files in the .delete directory.
Moreover we can add a check in testrunner also to check for .delete directory after all the tests finishes.
Comment by Ketaki Gangal [ 14/Jul/14 ]
Hi Nimish,

This test runs for 2-3 days and the logs are from this run.
From the logs / outputs attached above, I dont see any files on the /data/.delete directory.

Btw, the testrunner tests are much smaller, run singular rebalances per test and have the clusters torn down after, I am not sure if this would be helpful to add a ".delete" check on it.
Please advise.

Comment by Sriram Melkote [ 14/Jul/14 ]
Hi Ketaki - yes, please add the check to the long running tests only. It could note the contents of the "delete" directory before stopping the cluster.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

could be related to R-14 and might go away. added logging to investigate. keeping it open to confirm from QE testing.
Comment by Ketaki Gangal [ 22/Jul/14 ]
Re-tested with build 3.0.0-973-rel. Not observing any files in .delete dir

Runtime : 72 hours+

Rebalance operations on this cluster include
1- Rebalance in 1 node
2. Swap Rebalance
3. Rebalance out 1 node
4. Rebalance in 2 nodes
5. Failover , add back

Logs attached incldude
1. Collect Info from the cluster https://s3.amazonaws.com/bugdb/MB-10921/10921-2.tar
2. .delete contents https://www.couchbase.com/issues/secure/attachment/21346/ls_delete.rtf
3. lsof beam https://www.couchbase.com/issues/secure/attachment/21347/lsof_beam.rtf

Comment by Sriram Melkote [ 22/Jul/14 ]
Nimish - it looks like Ketaki's runs show clean result after a long test, and so we have done due diligence and can assume (R16 probably) fixed the leak. If you agree, please close the bug. Thanks for your help Ketaki.
Comment by Ketaki Gangal [ 22/Jul/14 ]
Seeing some entries on .delete directory on the jenkins run suite -- /opt/couchbase/var/lib/couchbase/data/.delete ..

Testrunner suite which runs into this is http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/93/consoleFull


From one of the nodes : 172.23.106.186

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# cd .delete/
root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data/.delete# ls -alth
total 32K
drwxrwx--- 6 couchbase couchbase 4.0K Jul 22 13:32 ..
drwxrwx--- 8 couchbase couchbase 4.0K Jul 22 13:24 .
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 13:09 315f1070a8e2e6413c8bf8177aa75f48
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 10:23 a659f4c60beb398c38b7a2563694f5fe
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 09:53 a8d29f6762f20ff56f6c542b19787d88
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 09:35 5f6ed028ea9afb7f7a1a09ae45fc3579
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 04:11 b8cd56417d94eba728a2a21e27c487b6
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 02:39 d47537360387b3fc6ba8d740acd61d34

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# du -h .delete/
4.0K .delete/315f1070a8e2e6413c8bf8177aa75f48
4.0K .delete/a8d29f6762f20ff56f6c542b19787d88
4.0K .delete/b8cd56417d94eba728a2a21e27c487b6
4.0K .delete/a659f4c60beb398c38b7a2563694f5fe
4.0K .delete/d47537360387b3fc6ba8d740acd61d34
4.0K .delete/5f6ed028ea9afb7f7a1a09ae45fc3579
28K .delete/

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# lsof -a -p 9355
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
beam.smp 9355 couchbase cwd DIR 252,0 4096 136610 /opt/couchbase/var/lib/couchbase
beam.smp 9355 couchbase rtd DIR 252,0 4096 2 /
beam.smp 9355 couchbase txt REG 252,0 52200651 135515 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp
beam.smp 9355 couchbase mem REG 252,0 182169 135218 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto_callback.so
beam.smp 9355 couchbase mem REG 252,0 92720 1177572 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
beam.smp 9355 couchbase mem REG 252,0 1612544 1183307 /lib/x86_64-linux-gnu/libcrypto.so.0.9.8
beam.smp 9355 couchbase mem REG 252,0 495975 135217 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto.so
beam.smp 9355 couchbase mem REG 252,0 159200 1177451 /lib/x86_64-linux-gnu/libtinfo.so.5.9
beam.smp 9355 couchbase mem REG 252,0 1811128 1177440 /lib/x86_64-linux-gnu/libc-2.15.so
beam.smp 9355 couchbase mem REG 252,0 31752 1177509 /lib/x86_64-linux-gnu/librt-2.15.so
beam.smp 9355 couchbase mem REG 252,0 135366 1177444 /lib/x86_64-linux-gnu/libpthread-2.15.so
beam.smp 9355 couchbase mem REG 252,0 133808 1177407 /lib/x86_64-linux-gnu/libncurses.so.5.9
beam.smp 9355 couchbase mem REG 252,0 1030512 1182376 /lib/x86_64-linux-gnu/libm-2.15.so
beam.smp 9355 couchbase mem REG 252,0 14768 1177438 /lib/x86_64-linux-gnu/libdl-2.15.so
beam.smp 9355 couchbase mem REG 252,0 10632 1182384 /lib/x86_64-linux-gnu/libutil-2.15.so
beam.smp 9355 couchbase mem REG 252,0 149280 1182383 /lib/x86_64-linux-gnu/ld-2.15.so
beam.smp 9355 couchbase 0r FIFO 0,8 0t0 706505 pipe
beam.smp 9355 couchbase 1w FIFO 0,8 0t0 706504 pipe
beam.smp 9355 couchbase 2w FIFO 0,8 0t0 706504 pipe
beam.smp 9355 couchbase 3u 0000 0,9 0 7808 anon_inode
beam.smp 9355 couchbase 4r FIFO 0,8 0t0 705201 pipe
beam.smp 9355 couchbase 5w FIFO 0,8 0t0 705201 pipe
beam.smp 9355 couchbase 6r FIFO 0,8 0t0 705202 pipe
beam.smp 9355 couchbase 7w FIFO 0,8 0t0 705202 pipe
beam.smp 9355 couchbase 8w REG 252,0 4320 139124 /opt/couchbase/var/lib/couchbase/logs/ssl_proxy.log
beam.smp 9355 couchbase 9u IPv4 702269 0t0 TCP *:11214 (LISTEN)
beam.smp 9355 couchbase 10u IPv4 705207 0t0 TCP localhost:11215 (LISTEN)


Logs https://s3.amazonaws.com/bugdb/MB-10921/10921-3.tar
-- note this is from a set of tests and not a single test in itself. I am not currently certain of how reproducible this is. But I am seeing this across a couple of machines which are failing due to view-queries taking longer time to run.
Comment by Nimish Gupta [ 30/Jul/14 ]
These are the directories in .delete directory not files. These directories also looks to be empty. In lsof also there is no file which is deleted. So this is not a critical case. I suggest that we can close this jira and reopen it if we see files in the .delete directories.




[MB-11692] On server group tab missing white arrow to the menu bar. Created: 11/Jul/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Trivial
Reporter: Patrick Varley Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: UI
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Server groups UI.png     PNG File Server Nodes UI.png    
Triage: Untriaged
Is this a Regression?: No

 Description   
See the screenshot attached.

I assume this is because there is no Server group in the menu bar.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 30/Jul/14 ]
http://review.couchbase.org/40053




[MB-11596] Alert popup reportedly prevent any use of UI (was: Couchbase GUI Console: Allow For Overriding Repetitive PopUp Alerts) Created: 30/Jun/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Morrie Schreibman Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: All OS

Issue Links:
Dependency
Gantt: start-finish
Flagged:
Impediment

 Description   
Customer recently encountered a problem in which disks filled to capacity on several nodes of the cluster simultaneously.

The problem was that various alerts popped up reiteratively and essentially locked the gui: customer was unable to execute any node removals and/or rebalances because of the non-stop stream of alerts.
Customer also expressed frustration because his email inbox was flooded with thousands of copies of the same alert.

Some mechanism is needed which will either automatically eliminate the same alert from popping up continuously or which will allow the customer to override the alerts in order to have access to the GUI.


 Comments   
Comment by Aleksey Kondratenko [ 30/Jun/14 ]
This description doesn't fit "how alerts are supposed to work". So something wasn't right clearly. So I need more details on what went wrong. With logs of all the nodes as usual.
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
Email thing has to be addressed separately. It's a matter of fact that alerts that we have today are "poor man's" alerts. They were since very beginning just a quick way to deliver most critical alerts to user that more or less works.

We're aware of lots of problems with that including possibility of spamming users.

I'll take a look at this as part of 3.0. We're _not_ supposed to overwhelm UI with those preventing UI from being useful. That's bug.
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
Updated title reflecting change of this from feature request to bug.
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
For now people can disable alerts completely by commenting out call to initAlertsSubscriber(); in app.js
Comment by Anil Kumar [ 29/Jul/14 ]
Morrie - Did the workaround provided above works for you?




[MB-11697] Automated cbcollect fails silently after 30 seconds Created: 11/Jul/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: David Haikney Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 2014-07-11-164340_1920x1048_scrot.png     File MB-11697.mov    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
(Applies to 3.0 build 918)

* Go to Log->Collect Information
* Select "Upload to Couchbase"
* Set the host as s3.mistype.com
* Enter valid details for customer and ticket number.
* Click collect

The collect button is disabled but the "collecting" dialogue doesn't appear. There are no error messages and then after approx 30 seconds the Collect button is enabled again. If the collection process is unsuccessful, we need to provide feedback to the user as to what has happened.

 Comments   
Comment by Aleksey Kondratenko [ 11/Jul/14 ]
Please note, that next time we might bounce ticket back if it doesn't have any diagnostics attached.
Comment by Aliaksey Artamonau [ 11/Jul/14 ]
Attached a screenshot of an error that I got.
Comment by Aliaksey Artamonau [ 11/Jul/14 ]
Please provide the cbcollect_info output.
Comment by David Haikney [ 14/Jul/14 ]
Apologies for not uploading immediately - I anticipated this would be straightforward to repro. cbcollect from my 1 node cluster is here:
http://customers.couchbase.com.s3.amazonaws.com/davidH/MB-11697.zip

Have attached a screen capture to this ticket so you can see the behaviour I was seeing.
Cheers, DH
Comment by Aliaksey Artamonau [ 14/Jul/14 ]
http://review.couchbase.org/#/c/39357/ fixes part of the problem. But there're still some things to be fixed on the UI side.
Comment by Pavel Blagodov [ 17/Jul/14 ]
What kind of things? Could you describe please?
Comment by Aliaksey Artamonau [ 17/Jul/14 ]
I thought that Alk would convey this information to you during your sync-ups. Assuming that request to /controller/startLogsCollection takes noticeable amount of time, there are at least two problems:

- there's no spinner
- if go to any other page and then go back to log collection page, "collect" button is not disabled as it's supposed to be
Comment by Aleksey Kondratenko [ 17/Jul/14 ]
Indeed, I haven't. We can discuss this tomorrow (my) morning.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 18/Jul/14 ]
http://review.couchbase.org/39531
Comment by Aleksey Kondratenko [ 18/Jul/14 ]
Still not fully fixed btw. Let's discuss on monday.
Comment by Pavel Blagodov [ 30/Jul/14 ]
I have checked it, looks like fully fixed.




[MB-11784] GUI incorrectly displays vBucket number in stats Created: 22/Jul/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Ian McCloy Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 251VbucketDisplay.png     PNG File 3fixVbucketDisplay.png    
Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Many customers are confused and have complained that on the "General Bucket Analytics" / "VBUCKET RESOURCES" page when listing the number of vBuckets, the GUI tries to convert the value 1024 default vBuckets to kilobytes, so it displays as 1.02k vBuckets (screen shot attached) . vBuckets shouldn't be parsed and always should show the full number.

I've changed the javascript to detect for vBuckets values and not parse them, (screen shot attached) . Will amend with a gerrit link when it's pushed to review.

 Comments   
Comment by Ian McCloy [ 22/Jul/14 ]
Code added to gerrit for review -> http://review.couchbase.org/#/c/39668/
Comment by Pavel Blagodov [ 24/Jul/14 ]
Hi Ian, here is clarification:
- kilo (or 'K') is a unit prefix in the metric system denoting multiplication by one thousand.
- kilobyte (or 'KB') is a multiple of the unit byte for digital information.
Comment by Ian McCloy [ 24/Jul/14 ]
Pavel thank you for clearing that up for me. Can you please explain when I see 1.02K vBuckets in the stats is that 1022, 1023 or 1024 active vBuckets, I'm not clear when I look at the UI.
Comment by Pavel Blagodov [ 25/Jul/14 ]
1.02K is expected value because currently UI truncates all analytic stats to three digits. Of course we may increase this number to four digits but this will be working only for K (not for M for example).
Comment by David Haikney [ 25/Jul/14 ]
@Pavel - Yes 1.02k is currently expected but the desire here is to change the UI to show "1024" instead of "1.02K". Fewer characters and more accuracy.
Comment by Pavel Blagodov [ 30/Jul/14 ]
http://review.couchbase.org/39668




[MB-11788] [ui] getting incorrect ejection policy update warning when simply updating bucket's quota Created: 22/Jul/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
SUBJ.

1. Create bucket aaa

2. Open bucket aaa Edit dialog

3. Change quota and hit enter

4. Observe modal popup warning that should not be there. I.e. we're not updating any property that would require bucket restart but we're getting warning.


 Comments   
Comment by Pavel Blagodov [ 30/Jul/14 ]
fixed in http://review.couchbase.org/40046




[MB-11119] Fix Bucket Analytics page when no buckets in cluster Created: 14/May/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-05-14 at 1.58.49 PM.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Scenario
- No buckets in cluster
- User clicks on one of the Nodes
- Comes to Bucket Analytics page it shows its "loading"

expected behavior
- On Bucket Analytics page instead of "Loading" message "No buckets currently defined!"

 Comments   
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, Parag, Anil
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 30/Jul/14 ]
http://review.couchbase.org/38306




[MB-11616] Rebalance not available 'pending add rebalanace' while loading data Created: 02/Jul/14  Updated: 30/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-07-02 at 11.23.16 AM.png    
Issue Links:
Relates to
relates to MB-11836 rebalance button grey out after add a... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
4 node cluster
scenario

1. Loading data in single node (only one node at this time)
2. Data loading starts (progress is shown)
3. Add new nodes to cluster (3 more nodes added)
4. But Rebalance is not available until data loading is complete

Screenshot attached

Expected -

On Pending Rebalance - warning message "Rebalance is not available until data loading is completed"

 Comments   
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
You forgot to mention that it's sample data loading.

We explicitly prevent rebalance in this case because samples loader is unable to deal with rebalance. There was bug fixed as part of 3.0 work.
Comment by Anil Kumar [ 07/Jul/14 ]
Got it. Please check the expected section we need some message to user.
Comment by Aleksey Kondratenko [ 07/Jul/14 ]
Please adapt ui to show this message if data loading task is present.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 18/Jul/14 ]
http://review.couchbase.org/39533
Comment by Aleksey Kondratenko [ 28/Jul/14 ]
Fix was merged but then reverted.
Comment by Pavel Blagodov [ 30/Jul/14 ]
http://review.couchbase.org/40051




[MB-11851] memcached -l argument generates server error Created: 30/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Adam Taylor Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
To increase the connection limit for port 11209 it is necessary to override the -l argument passed from ns-server to memcached, either via a wrapper script or via the diag/eval endpoint. In 3.0 this now generates an "Unexpected Server error". The -c flag for overall connection limit works as expected.

Steps to reproduce:

Running:

curl --data 'ns_config:set({node, node(), {memcached, extra_args}}, ['-l0.0.0.0:11210,0.0.0.0:11209:3000']).' -u Administrator:<PASSWORD> http://localhost:8091/diag/eval

Will generate an "Unexpected Server error" on 3.0.x but will return OK in 2.5.x




[MB-11841] BUILD BREAKAGE: memory_tracker.cc Created: 29/Jul/14  Updated: 30/Jul/14  Due: 29/Jul/14  Resolved: 30/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: .master, 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Fixed Votes: 0
Labels: ep-engine
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
[ 42%] Building CXX object ep-engine/CMakeFiles/ep.dir/src/memory_tracker.cc.o
/buildbot/build_slave/centos-5-x64-master-builder/build/build/ep-engine/src/memory_tracker.cc: In member function ‘void MemoryTracker::getAllocatorStats(std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, long unsigned int, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char> >, long unsigned int> > >&)’:
/buildbot/build_slave/centos-5-x64-master-builder/build/build/ep-engine/src/memory_tracker.cc:120: error: ‘struct allocator_stats’ has no member named ‘free_mapped_size’
/buildbot/build_slave/centos-5-x64-master-builder/build/build/ep-engine/src/memory_tracker.cc:122: error: ‘struct allocator_stats’ has no member named ‘free_unmapped_size’
make[5]: *** [ep-engine/CMakeFiles/ep.dir/src/memory_tracker.cc.o] Error 1


This appears to be caused by commit f82a8d7fcfc807903ff0c15a93d63e52a16ccec9 .

It is failing on all Linux platforms, eg:

http://builds.hq.northscale.net:8010/builders/centos-5-x64-master-builder/builds/921/steps/couchbase-server%20make%20enterprise%20/logs/stdio

 Comments   
Comment by Dave Rigby [ 29/Jul/14 ]
Can you confirm this actually affects 3.0? I believe it should only affect master (a.k.a. 0.0)

There is a matching commit in memcached which is also needed, however it has only been committed onto memcached/3.0, and has not yet been merged to memcached/master.

The fix is to merge memcached/3.0 into memcached/master.
Comment by Chris Hillery [ 29/Jul/14 ]
You are correct, this affects master, not 3.0. Marking it as "3.0.1" for lack of a better tag.

Will you work with memcached to either get the change merged or revert your commit?
Comment by Dave Rigby [ 29/Jul/14 ]
Yep, on it with Trond.
Comment by Chris Hillery [ 29/Jul/14 ]
Reverting to "Test Blocker" - the master branch is also tested, albeit not as frequently as 3.0. Broken builds are always highest priority.
Comment by Dave Rigby [ 29/Jul/14 ]
Master should be fixed as of:

* 69d3f69 2014-07-29 | Merge remote-tracking branch 'membase/3.0.1' (HEAD, membase/master, m/master) [Trond Norbye]

Comment by Dave Rigby [ 30/Jul/14 ]
Build appears ok as of: http://builds.hq.northscale.net:8010/builders/centos-5-x64-master-builder/builds/923




[MB-11850] Search feature for new docs site Created: 30/Jul/14  Updated: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Dave Rigby Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Looking at the new docs [1], while breaking it into smaller sections *is* much better I note there doesn't appear to be a way to search them.

While the old docs didn't have a search either, it was less of an issue as I could, for example open the (one big page) admin guide, then use browser search (⌘F) to look for the particular option, string of feature I wanted.

Without this I'm afraid to say that overall the new docs are *less* usable than the old ones, as my only option is to scurry around in the different sections trying to hunt down the info I need.

I expect that a google custom search may be sufficient (as we have on the main website), but currently there is no such box on the beta site.

[1]: http://docs.couchbase.com/prebuilt/couchbase-manual-3.0/beta-intro.html

 Comments   
Comment by Matt Ingenthron [ 30/Jul/14 ]
I believe there is a plan to re-add both comments and search soon. I'll keep this in mind for other site changes upcoming.

The other thing we need to address is a way to get PDFs of these, since we have that again.




[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 30/Jul/14  Due: 30/Jul/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).




[MB-11203] SSL-enabled memcached will hang when given a large buffer containing many pipelined requests Created: 24/May/14  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Mark Nunberg Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sample code which shows filling in a large number of pipelined requests being flushed over a single buffer.

#include <libcouchbase/couchbase.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int remaining = 0;

static void
get_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_get_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }
    remaining--;
}

static void
stats_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_server_stat_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }

    if (resp->v.v0.server_endpoint == NULL) {
        fflush(stdout);
        --remaining;
    }
}

#define ITERCOUNT 5000
static int use_stats = 1;

static void
do_stat(lcb_t instance)
{
    lcb_CMDSTATS cmd;
    memset(&cmd, 0, sizeof(cmd));
    lcb_error_t err = lcb_stats3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

static void
do_get(lcb_t instance)
{
    lcb_error_t err;
    lcb_CMDGET cmd;
    memset(&cmd, 0, sizeof cmd);
    LCB_KREQ_SIMPLE(&cmd.key, "foo", 3);
    err = lcb_get3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

int main(void)
{
    lcb_t instance;
    lcb_error_t err;
    struct lcb_create_st cropt = { 0 };
    cropt.version = 2;
    char *mode = getenv("LCB_SSL_MODE");
    if (mode && *mode == '3') {
        cropt.v.v2.mchosts = "localhost:11996";
    } else {
        cropt.v.v2.mchosts = "localhost:12000";
    }
    mode = getenv("USE_STATS");
    if (mode && *mode != '\0') {
        use_stats = 1;
    } else {
        use_stats = 0;
    }
    err = lcb_create(&instance, &cropt);
    assert(err == LCB_SUCCESS);


    err = lcb_connect(instance);
    assert(err == LCB_SUCCESS);
    lcb_wait(instance);
    assert(err == LCB_SUCCESS);
    lcb_set_get_callback(instance, get_callback);
    lcb_set_stat_callback(instance, stats_callback);
    lcb_cntl_setu32(instance, LCB_CNTL_OP_TIMEOUT, 20000000);
    int nloops = 0;

    while (1) {
        unsigned ii;
        lcb_sched_enter(instance);
        for (ii = 0; ii < ITERCOUNT; ++ii) {
            if (use_stats) {
                do_stat(instance);
            } else {
                do_get(instance);
            }
            remaining++;
        }
        printf("Done Scheduling.. L=%d\n", nloops++);
        lcb_sched_leave(instance);
        lcb_wait(instance);
        assert(!remaining);
    }
    return 0;
}


 Comments   
Comment by Mark Nunberg [ 24/May/14 ]
http://review.couchbase.org/#/c/37537/
Comment by Mark Nunberg [ 07/Jul/14 ]
Trond, I'm assigning it to you because you might be able to delegate this to another person. I can't see anything obvious in the diff since the original fix which would break it - of course my fix might have not fixed it completely but just have made it work accidentally; or it may be flush-related.
Comment by Mark Nunberg [ 07/Jul/14 ]
Oh, and I found this on an older build of master; 837, and the latest checkout (currently 055b077f4d4135e39369d4c85a4f1b47ab644e22) -- I don't think anyone broke memcached - but rather the original fix was incomplete :(




[MB-10907] Missing UPR config :: UUID difference observed in (active vs replica) vbuckets after online upgrade 2.5.1 ==> 3.0.0-593 Created: 19/Apr/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Thuan Nguyen
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive online_upgrade_logs_2.5.1_3.0.0.tar.gz    
Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: 1. Install 2.5.1 on 10.6.2.144, 10.6.2.145
2. Add 10.6.2.145 to 10.6.2.144 and rebalance
3. Add default bucket with 1024 vbuckets to cluster
4. Add ~ 1000 items to the buckets
5. Online upgrade cluster to 3.0.0-593 with 10.6.2.146 as our extra node
6. Finally cluster has node 10.6.2.145, 10.6.2.146

Check V-bucket id for active and replica v-bucket, after all replication is complete, plus disk queue drained.

Expectation: Should be same according to UPR

Actual Result: Different as per observation. Without an upgrade they are same.

Note that: In the build we tested UPR was not turned due to missing COUCHBASE_REPL_TYPE = upr and we operating with TAP. This case will occur during upgrade and we have to fix the config before upgrade.

Example of difference in UUID

On 10.6.2.145 where vb_9 is active
 vb_9:high_seqno: 14

 vb_9:purge_seqno: 0

 vb_9:uuid: 18881518640852

On 10.6.2.146 where vb_9 is replica
 vb_9:high_seqno: 14

 vb_9:purge_seqno: 0

 vb_9:uuid: 120602843033209





Is this a Regression?: No

 Comments   
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
I can't find any evidence that there was even an attempt to upgrade replications to upr after rebalance. I assume that you forgot to set COUCHBASE_REPL_TYPE environment variable accordingly.
Comment by Parag Agarwal [ 22/Apr/14 ]
isn't UPR switched on by default? for version like 3.0.0-593 or do we need to set explicitly?
Comment by Parag Agarwal [ 22/Apr/14 ]
I will re-run the scenario by adding it in the config file and let you know the results. But I think if this is not on by default, we should have it on to avoid this scenario at least.
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
You need to set it explicitly.
Comment by Aliaksey Artamonau [ 22/Apr/14 ]
Please also note that currently it's known that upgrade to UPR is broken: MB-10928.
Comment by Parag Agarwal [ 22/Apr/14 ]
Thanks for the update, I am going to change the scope of this bug due to the issue observed was missing COUCHBASE_REPL_TYPE = upr
Comment by Parag Agarwal [ 22/Apr/14 ]
Changing scope of the bug as per comments from the dev. We need to fix the config of our install package to switch on UPR by default.
Comment by Aleksey Kondratenko [ 08/May/14 ]
Given it's "retest after upr is default", I'm moving it from dev.

There is nothing dev needs to do with this right now
Comment by Parag Agarwal [ 08/May/14 ]
Re-assgined bug to Tony since he handles functional upgrade tests. Thanks, Alk! is there a bug open for this? Can you please add it here
Comment by Anil Kumar [ 19/Jun/14 ]
Tony - Please update the ticket if you have tested with recent builds.
Comment by Wayne Siu [ 29/Jul/14 ]
Please re-open if this is an issue.




[MB-11847] Warmup stats ep_warmup_estimated_value_count returns "unknown" Created: 29/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Venu Uppalapati Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build 973

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Warmup stats ep_warmup_estimated_value_count returns "unknown"
ep_warmup_estimated_key_count: 99999
ep_warmup_estimated_value_count: unknown





[MB-11826] Don't send unnecessary config updates. Created: 26/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Brett Lawson Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
The memcached service should keep track of the last revId that it sent to a client who is connected, and avoid dispatching an identical config to the client during a NMV. Currently, when pipelining operations, a client may receive hundreds of NMV responses when only a single one is necessary. This won't prevent multiple NMV's across nodes, but will prevent the bulk of the spam.

Note: An explicit request for the config via CCCP_GET_VBUCKET_CONFIG should always return the data regardless of revId info.

 Comments   
Comment by Jeff Morris [ 28/Jul/14 ]
This would alleviate the "spam" problem with configs during rebalance/swap/failover scenarios. +1.




[MB-10260] couchbase-cli rebalance reports incorrect rebalance status Created: 19/Feb/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, tools
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Larry Liu Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 1
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency

 Description   
While performing rebalance from command line, the command line said success but the UI shows that rebalance still in progress.

The command used:
 /opt/couchbase/bin/couchbase-cli rebalance -c fcdstagea03 -u admin -p pass --server-remove=164.55.92.96

I was able to reproduce the issue under load.



 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
Took a look at the code, where couchbase-cli just kicks off the rebalance and then polls /pools/default/rebalanceProgress in a sleepy loop until status is no longer "running"...

https://github.com/couchbase/couchbase-cli/blob/master/node.py#L847

That's so straightforward that I'm guessing that the root issue is instead in ns-server. Perhaps, has the rebalanceProgress response information has changed? Or, perhaps is rebalanceProgress output sometimes wobbly during a server removal or while under load?

Related: I see from the ns-server/doc/api.txt that rebalanceProgress is now deprecated, but (whew) that menelaus_web.erl still has handle_rebalance_progress() code at the moment (since 2010!). Please, hopefully, it stays for now in 3.0. :-) And: MB-11848 to remember to handle the deprecation.
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th




[MB-10039] [Multi-instance testing]CBworkloadgen crashed while running server re-add rebalance during 20Million item insert run. Created: 27/Jan/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.4 64-bit
RAM:256GB
Dual 2.9GHz 8-core Xeon E5-2690 for 32 total cores (16 + hyperthreading)

Attachments: File splitcbworkcrash4.z01     Zip Archive splitcbworkcrash4.zip    
Triage: Triaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
the following message is seen when an instance is failed over and rebalance is clicked. the instance is not the instance that CBworkloadgen is connected to.
2014-01-28 11:42:16,919: s1 refreshing sink map: http://localhost:9000
2014-01-28 11:42:16,927: s0 refreshing sink map: http://localhost:9000
2014-01-28 11:42:16,927: s2 refreshing sink map: http://localhost:9000
2014-01-28 11:45:10,348: s2 error: async operation: error: conn.send() exception: [Errno 32] Broken pipe on sink: http://localhost:9000(default@N/A-0)
2014-01-28 11:45:10,349: s0 error: async operation: error: conn.send() exception: [Errno 32] Broken pipe on sink: http://localhost:9000(default@N/A-2)
2014-01-28 11:45:10,351: s1 error: async operation: error: conn.send() exception: [Errno 32] Broken pipe on sink: http://localhost:9000(default@N/A-1)
error: conn.send() exception: [Errno 32] Broken pipe

Steps to reproduce:
0) following steps were done to simulate system level tests.
1) run following command to start inserting 20M items into cluster.
user1@xxxx-1111 bin]$ ./cbworkloadgen -n localhost:9000 -i 200000000 -t 3 -u Administrator -p password
2) create groups and hit rebalance.
3)while rebalance is in progress, hit failover on one node and then add it back. Click rebalance again.
4)at some point during the above steps, cbworkloadgen crashed.

2014-01-27 18:52:59,610: s0 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 436, key: pymc77480993, spec: http://localhost:9000, host:port: 172.23.100.18:12000
2014-01-27 18:52:59,610: s0 warning: received NOT_MY_VBUCKET; perhaps the cluster is/was rebalancing; vbucket_id: 439, key: pymc77480108, spec: http://localhost:9000, host:port: 172.23.100.18:12000
2014-01-27 18:52:59,613: s0 refreshing sink map: http://localhost:9000
2014-01-27 18:53:48,938: s0 error: recv exception: [Errno 104] Connection reset by peer
2014-01-27 18:53:48,938: s2 error: recv exception: [Errno 104] Connection reset by peer
2014-01-27 18:53:48,938: s1 error: recv exception: [Errno 104] Connection reset by peer
2014-01-27 18:53:48,938: s0 MCSink exception:
2014-01-27 18:53:48,938: s2 MCSink exception:
2014-01-27 18:53:48,939: s1 MCSink exception:
2014-01-27 18:53:48,939: s0 error: async operation: error: MCSink exception: on sink: http://localhost:9000(default@N/A-0)
2014-01-27 18:53:48,939: s2 error: async operation: error: MCSink exception: on sink: http://localhost:9000(default@N/A-2)
2014-01-27 18:53:48,940: s1 error: async operation: error: MCSink exception: on sink: http://localhost:9000(default@N/A-1)
error: MCSink exception:


 Comments   
Comment by Venu Uppalapati [ 27/Jan/14 ]
attached cbcollect_info from node of cbworkloadgen crash
Comment by Anil Kumar [ 04/Jun/14 ]
Triage - June 04 2014 Bin, Ashivinder, Venu, Tony
Comment by Anil Kumar [ 17/Jun/14 ]
Triage : June 17 2014 Anil, Bin, Wayne, Ashvinder, Tony
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Bin, Ashvinder, Wayne .. July 17th

Bin is going to provide us the upper limit for max items cbworkloadgen can accept. whats the default?

Assign this to documentation.

Comment by Steve Yen [ 29/Jul/14 ]
Spoke with Bin and the notes...

Awhile back, he wasn't able to reproduce this.

Also, back in July, Bin made a rebalance related fix to cbworkloadgen (MB-7981), although that was just to smooth out pauses. But, it might be related.

Also, there should be no limits in the "-i" param to cbworkloadgen, other than python integer limits (2^63; MAX_INT). (Also, btw, above in the cmd-line it has 200 million item insert instead of 20 million items.)
Comment by Anil Kumar [ 29/Jul/14 ]
Ruth - Need to document the limitation or limit.




[MB-10280] couchstore commit() may be incorrectly padding file size prior to first fsync, causing second fsync to do more work Created: 21/Feb/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Marty Schoch Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: May also affect earlier versions I observed this issue while reading couchstore codebase.

Attachments: Text File couchstore_header.patch    
Triage: Untriaged

 Description   
I discussed this with Aaron in irc and he agreed there may be a problem.

The code here:

https://github.com/couchbase/couchstore/blob/master/src/couch_db.cc#L189-L193

is attempting to extend the file size to account for the header which will subsequently be written after the first fsync. (this allows the second fsync to be an fdatasync, which avoids writing metadata)

But, the headers need to be aligned on 4096-byte block boundaries and this calculation does not account for that. Further the db_write_buf() method does not account for that either.

This means that in most cases, the file size will change again when we actually write header.

 Comments   
Comment by Aleksey Kondratenko [ 13/Mar/14 ]
Interestingly, I actually remember testing this stuff and seeing no metadata commits in blktrace of second fdatasync. Weird.
Comment by Marty Schoch [ 26/Mar/14 ]
I did some more testing on this. Here is a couchstore file after 1 document was added and the changes were committed.

$ hexdump -C go-couchstore.couch
00000000 01 00 00 00 1d f0 29 2b 6b 0b 00 00 00 00 00 00 |......)+k.......|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 80 00 00 11 b2 db 45 1d 0f 38 7b 22 63 6f |........E..8{"co|
00000030 6e 74 65 6e 74 22 3a 31 32 33 7d 80 00 00 23 4a |ntent":123}...#J|
00000040 71 35 c4 22 2c 01 00 50 00 00 17 64 6f 63 2d 30 |q5.",..P...doc-0|
00000050 00 01 01 44 01 00 00 00 19 00 00 00 00 00 22 00 |...D..........".|
00000060 00 00 00 00 01 80 80 00 00 23 5d a9 ba 13 23 18 |.........#]...#.|
00000070 01 00 60 00 00 17 00 01 01 14 01 00 50 00 00 19 |..`.........P...|
00000080 01 0a 34 00 22 00 00 00 00 00 01 80 64 6f 63 2d |..4.".......doc-|
00000090 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |0...............|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000000d0 00 00 00 00 00 00 00 80 00 00 01 d2 02 ef 8d 00 |................|
000000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 01 00 00 00 4a 9c 85 5e c1 0b 00 00 00 00 00 01 |....J..^........|
00001010 00 00 00 00 00 00 00 00 00 00 00 00 00 11 00 1c |................|
00001020 00 00 00 00 00 00 00 66 00 00 00 00 00 2b 00 00 |.......f.....+..|
00001030 00 00 01 00 00 00 00 00 3b 00 00 00 00 00 2b 00 |........;.....+.|
00001040 00 00 00 01 00 00 00 00 00 00 00 00 00 00 19 |...............|
0000104f

The header is at 0x1000, the bySeq tree is at 0x66, the byId tree is at 0x3b, the document body is at 0x22.

There is a chunk written at 0xd7, but nothing points to it. The structure of it is for a 1 byte chunk, consistent with what we write when trying to extend the file. Only its at wrong spot, and would have unsuccessfully extended the file to the correct length.

This was done with an older version of couchstore than we currently use, so to see if it still happens I looked at a file created by Couchbase Server 2.5.

Here is the end of one of the beer-sample vbucket files:

0000b100 00 00 00 00 00 00 00 00 00 00 80 00 00 01 d2 02 |................|
0000b110 ef 8d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
0000b120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0000c000 01 00 00 00 56 9e 73 93 26 0b 00 00 00 00 00 09 |....V.s.&.......|
0000c010 00 00 00 00 00 00 00 00 00 00 00 00 00 11 00 1c |................|
0000c020 00 0c 00 00 00 00 a3 dc 00 00 00 00 02 65 00 00 |.............e..|
0000c030 00 00 09 00 00 00 00 a1 7c 00 00 00 00 02 60 00 |........|.....`.|
0000c040 00 00 00 09 00 00 00 00 00 00 00 00 00 14 ca 00 |................|
0000c050 00 00 00 b0 5b 00 00 00 00 00 5d |....[.....]|
0000c05b

Without studying it too carefully, we see at 0xb10a the start of what appears to be one of these 1 byte chunks. This is a pretty strong indication that this behavior is still happening.

I modified the code to print the end file position before and after writing the correct header. If the code is working correctly, we would expect the same file position in both. Here we see:

done doing file extend pos: 224
done writing actual file header: 4175

I then tested again with my patch.

done doing file extend pos: 4175
done writing actual file header: 4175

Patch attached.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Chiyoung - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-10292] [windows] assertion failure in test_file_sort Created: 24/Feb/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
assertion on line 263 fails: assert(ret == FILE_SORTER_SUCCESS);

ret == FILE_SORTER_ERROR_DELETE_FILE

 Comments   
Comment by Trond Norbye [ 27/Feb/14 ]
I've disabled the test for win32 with http://review.couchbase.org/#/c/33985/ to allow us to find other regressions..
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-7408] Incorrect return code from deletion of a nonexistent design document Created: 13/Dec/12  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: RESTful-APIs
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Trond Norbye Assignee: Aliaksey Artamonau
Resolution: Unresolved Votes: 0
Labels: ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
I'm getting 200 no matter what I name I put on the design document I want to delete instead of a 404 I would expect..

trond@ok:1363> curl -v -X GET -H 'Content-Type: application/json' 'http://192.168.0.61:8092/default/_design/bug&#39; ~/compile/couchbase/sdk/php
* About to connect() to 192.168.0.61 port 8092 (#0)
* Trying 192.168.0.61...
* connected
* Connected to 192.168.0.61 (192.168.0.61) port 8092 (#0)
> GET /default/_design/bug HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: 192.168.0.61:8092
> Accept: */*
> Content-Type: application/json
>
< HTTP/1.1 404 Object Not Found
< Server: MochiWeb/1.0 (Any of you quaids got a smint?)
< Date: Thu, 13 Dec 2012 08:12:42 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 41
< Cache-Control: must-revalidate
<
{"error":"not_found","reason":"missing"}
* Connection #0 to host 192.168.0.61 left intact
* Closing connection #0
trond@ok:1364> curl -v -X DELETE -H 'Content-Type: application/json' 'http://192.168.0.61:8092/default/_design/bug&#39; ~/compile/couchbase/sdk/php
* About to connect() to 192.168.0.61 port 8092 (#0)
* Trying 192.168.0.61...
* connected
* Connected to 192.168.0.61 (192.168.0.61) port 8092 (#0)
> DELETE /default/_design/bug HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: 192.168.0.61:8092
> Accept: */*
> Content-Type: application/json
>
< HTTP/1.1 200 OK
< Server: MochiWeb/1.0 (Any of you quaids got a smint?)
< Date: Thu, 13 Dec 2012 08:12:50 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 31
< Cache-Control: must-revalidate
<
{"ok":true,"id":"_design/bug"}
* Connection #0 to host 192.168.0.61 left intact
* Closing connection #0


 Comments   
Comment by Aleksey Kondratenko [ 13/Dec/12 ]
couch style document editing api is not supported, not official, known broken and will be killed.
Comment by Aleksey Kondratenko [ 13/Dec/12 ]
Ah. Sorry that's design doc in which case it's it public and supported.

Thanks for finding and filing this
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Alk, Consider fixing this for 3.0.
Comment by Anil Kumar [ 10/Jun/14 ]
Triage - June 10 2014 Anil




[MB-11413] [OS X] Total RAM incorrectly reported when memory is wired Created: 12/Jun/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: OS X 10.9 (Mavericks) 10.9.3

Attachments: PNG File Screen Shot 2014-06-12 at 16.24.51.png    
Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
Running 3.0.0-802 [1]), the amount of system RAM is reported incorrectly on the GUI

For example on my 16GB laptop, it is reported as per the attached screenshot: 4173 MB. This appears to be related / caused by the amount of wired memory in OS X (see comments). I had a few VMs running, and this appears to affect how much memory Erlang believes the system has.

This results in not being able to actually create as many buckets / allocate as much RAM as expected.

[1]: http://builder.hq.couchbase.com/manifest/couchbase-server-enterprise_x86_64_3.0.0-802-rel.zip

 Comments   
Comment by Dave Rigby [ 12/Jun/14 ]
Dropping from Critical -> Major. This appears to be related / caused by the amount of wired memory in OS X. I had a few VMs running, and this appears to affect how much memory Erlang believes the system has. I can kinda understand this behaviour - after all it's not possible to allocate / swap this and so it wouldn't be available to other apps.

However this could be confusing to users, and doesn't match our documentation, so leaving open for further comment...
Comment by Aleksey Kondratenko [ 12/Jun/14 ]
My team lacks expertise to handle this very environment-specific issue. Per our previous agreement, Ravi owns OSX platform until we find (ironically, among dozens of osx fans in this company) proper person.

This number is taken from erlang's memsup:get_memory_data call which is doing whatever os-specific tricks is necessary to find this out.
Comment by Dave Rigby [ 12/Jun/14 ]
Having looked in more detail at memsup:get_memory_data [1], the value it returns (total_memory) is documented as:

    "The total amount of memory available to the Erlang emulator, allocated and free. May or may not be equal to the amount of memory configured in the system."

So it's deliberately vague on whether it returns the actual RAM size; or just the amount of RAM "available"; hence it justifiably returns "total - wired" on OS X. It also appears this calculation has been the case for at least since R13B03, so it's nothing new.

Arguably it can be dropped in priority if no-one's had issue with it so far.

[1]: http://erlang.org/doc/man/memsup.html




[MB-11761] Modifying remote cluster settings from non-ssl to ssl immidiately after upgrade from 2.5.1-3.0.0 failed but passed after waiting for 2 minutes. Created: 17/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Upgrade from 2.5.1-1083 - 3.0.0-973

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-11761/b04bb6e3/10.3.3.126-diag.txt.gz
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-11761/e725c091/10.3.3.126-7172014-1030-diag.zip
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-11761/69f350e4/10.3.5.11-7172014-1032-diag.zip
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-11761/e0d1910f/10.3.5.11-diag.txt.gz

[Destination]
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-11761/204af908/10.3.5.60-7172014-1035-diag.zip
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-11761/feb4ca2f/10.3.5.60-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-11761/0c778e33/10.3.5.61-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-11761/58111486/10.3.5.61-7172014-1033-diag.zip
Is this a Regression?: Unknown

 Description   
Setting up non-ssl XDCR between Source and Destination with version 2.5.1-1083.
Upgrade remote clusters to 3.0.0-973 and change settings to SSL failed immediately.

[2014-07-17 10:28:29,416] - [rest_client:747] ERROR - http://10.3.5.61:8091/pools/default/remoteClusters/cluster0 error 400 reason: unknown {"_":"Error {{tls_alert,\"unknown ca\"},\n [{lhttpc_client,send_request,1,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,199}]},\n {lhttpc_client,execute,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,151}]},\n {lhttpc_client,request,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,83}]}]} happened during REST call get to http://10.3.3.126:18091/pools."}
[2014-07-17 10:28:29,416] - [rest_client:821] ERROR - /remoteCluster failed : status:False,content:{"_":"Error {{tls_alert,\"unknown ca\"},\n [{lhttpc_client,send_request,1,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,199}]},\n {lhttpc_client,execute,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,151}]},\n {lhttpc_client,request,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,83}]}]} happened during REST call get to http://10.3.3.126:18091/pools."}
ERROR
[2014-07-17 10:28:29,418] - [xdcrbasetests:158] WARNING - CLEANUP WAS SKIPPED

======================================================================
ERROR: offline_cluster_upgrade (xdcr.upgradeXDCR.UpgradeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/xdcr/upgradeXDCR.py", line 195, in offline_cluster_upgrade
    self._modify_clusters(None, self.dest_master, remote_cluster['name'], self.src_master, require_encryption=1)
  File "pytests/xdcr/xdcrbasetests.py", line 1123, in _modify_clusters
    demandEncryption=require_encryption, certificate=certificate)
  File "lib/membase/api/rest_client.py", line 835, in modify_remote_cluster
    self.__remote_clusters(api, 'modify', remoteIp, remotePort, username, password, name, demandEncryption, certificate)
  File "lib/membase/api/rest_client.py", line 822, in __remote_clusters
    raise Exception("remoteCluster API '{0} remote cluster' failed".format(op))
Exception: remoteCluster API 'modify remote cluster' failed

----------------------------------------------------------------------
Ran 1 test in 614.847s

[Jenkins]
http://qa.hq.northscale.net/job/centos_x64--104_01--XDCR_upgrade-P1/22/consoleFull

[Test]
./testrunner -i centos_x64--104_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.offline_cluster_upgrade,initial_version=2.5.1-1083-rel,replication_type=xmem,bucket_topology=default:1>2;bucket0:1><2,upgrade_nodes=dest;src,use_encryption_after_upgrade=src;dest

Workaround: I put wait of 120 seconds after upgrade and before changing XDCR seetings. and test passed.

Question: Is it expected behavior after upgrading from 2.5.1-1083 -> 3.0.0, since same test passes with no additional wait from as upgrading from 2.0 -> 3.0 or 2.5.0-3.0?


Issue occuring only for upgrade from 2.5.1-1083-rel -> 3.0.0-973-rel.
 

 Comments   
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
Have you tried waiting (much) less than 2 minutes ?
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
I have not tried it with less than 2 minutes, I will try with 1 minute or lesser and update you about the test result.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Test failed with 1 minute wait, but passed with 90 seconds wait.

Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Alk - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-9222] standalone moxi-server -- no core on RHEL5 Created: 04/Oct/13  Updated: 29/Jul/14  Due: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: build, moxi
Affects Version/s: 1.8.1, 2.1.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Alexander Petrossian (PAF) Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: moxi
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
moxi init.d script contains
{code}
ulimit -c unlimited
{code}
line, which is supposed to allow core-dumps.

But then it uses OS /etc/.../functions function "daemon",
which overrides this ulimit.

One need to use
{code}
DAEMON_COREFILE_LIMIT=unlimited
{code}

environment variable, which will be handled by "daemon" function to do "ulimit -c unlimited".

 Comments   
Comment by Alexander Petrossian (PAF) [ 04/Oct/13 ]
once we did that we found out that moxi does chdir("/").
we've found in sources, that one can use "-r" command line switch to prevent "ch /" from happening.
plus "/var/run" folder, which is chdired to prior "daemon" command is no good anyway, it can not be written to by "moxi" user anyway.

I feel that being able to write cores is very important.
I agree that that may not be a good idea to enable that by default.

But now this is broken in 3 places, which is not good.

we suggest:
# cd /tmp (instead of cd /var/run) -- usually safe place to write by any user and exists on all systems.
# document -r command line switch (currently not documented in "moxi -h")
# add DAEMON_COREFILE_LIMIT before calling "daemon" function
Comment by Alexander Petrossian (PAF) [ 04/Oct/13 ]
regarding the default... we see there is core (and .dump) here:
[root@spms-lbas ~]# ls -la /opt/membase/var/lib/membase
-rw------- 1 membase membase 851615744 Feb 19 2013 core.12674
-rw-r----- 1 membase membase 12285899 Oct 4 17:45 erl_crash.dump

so maybe it is a good idea to enable it by default?

[root@spms-lbas ~]# file /opt/membase/var/lib/membase/core.12674
/opt/membase/var/lib/membase/core.12674: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'memcached'
[root@spms-lbas ~]#
Comment by Matt Ingenthron [ 20/Dec/13 ]
Steve: who is the right person to look at this these days?
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Iryna,

can you confirm if this is still happening in 3.0?
If it is, pls assign to Steve Y. Otherwised, resolve and close.
Thanks.
Comment by Steve Yen [ 25/Jul/14 ]
(scrubbing through ancient moxi bugs)

Hi Chris,
Looks like Alexander Petrossian has found the issue and the fix with the DAEMON_COREFILE_LIMIT env variable.

Can you incorporate this into moxi-init.d ?

Thanks,
Steve
Comment by Chris Hillery [ 25/Jul/14 ]
For prioritization purposes: Are we actually producing a standalone moxi product anymore? I'm unaware of any builds for it, so does it make sense to tag this bug "3.0" or indeed fix it at all?
Comment by Steve Yen [ 25/Jul/14 ]
Hi Chris,
We are indeed still supposed to provide a standalone moxi build (unless I'm out of date on news).

Priority-wise, IMHO, it's not highest (but that's just my opinion), as I believe folks can still get by with the standalone moxi from 2.5.1. That is moxi hasn't changed that very much functionally -- although Trond & team did a bunch of rewrite / refactoring to make it easier to build and develop (cmake, etc).

Cheers,
Steve




[MB-11807] couchbase server failed to start in ubuntu when upgrade from 2.0 to 3.0 if it could not find the database Created: 23/Jul/14  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.0 in one ubuntu 12.04 64-bit
Initialize it with custom data and index path (/tmp/data and /tmp/index)
Create default bucket
Load 1K items to bucket
Shutdown couchbase server
Remove all files under /tmp/data/ and /tmp/index
Upgrade couchbase server to 3.0.0-995
Couchbase server failed to start due to could not find database.
Manually start couchbase server. Couchbase server starts normally with no items in bucket as expected.

The point here is that couchbase server should start even it could not find database files

It may relate to bug MB-7705

 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
First, I really don't think this is a valid test case. If data directory and index directory are gone, config.dat becomes obsolete and upgrade script won't be able to identify old directories and retrieve host information for upgrade to proceed. Logically, it really doesn't matter whether you cannot proceed and finish the upgrade but start as a branch new node, or you simply fail the upgrade process. Because of data loss, it will be equivalent to install a new setup, i.e. a node failover.
Comment by Thuan Nguyen [ 28/Jul/14 ]
In this case, it does not matter data is in the node or not, or upgrade or not upgrade, couchbase server does not start after done the installation.
Comment by Anil Kumar [ 28/Jul/14 ]
Bin - Discussed with Tony. Looks like we have inconsistency in the way this particular scenario works on each platform. In this scenario on CentOS Couchbase Server starts back with error message that it cannot find the data files whereas on Ubuntu Couchbase Server crashes and doesn't start.

Comment by Anil Kumar [ 29/Jul/14 ]
Bin - I agree with you. Upgrade should fail with error message "Upgrade failed. Cannot find the data (path) and index (path) files". And this should be same across all the platforms.
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Bin - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-4408] Debian package does not protect config files Created: 05/Nov/11  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: build, installer, moxi
Affects Version/s: 1.7.1.1, 1.7.2
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ben Beuchler Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 8.04 and 10.04

Triage: Untriaged

 Description   
The Debian packages for both membase-server and moxi-server do include the configuration files (/opt/membase/etc and /opt/moxi/etc) in the "conffiles" list in the package. This means that dpkg does not treat them as configuration files and replaces them with the new version when upgrading. This destroys the previous config.

If the configuration files were installed under /etc, Debian would automatically handle them appropriately. Alternately, listing the files in the "conffiles" file when building the package should accomplish the same thing.

 Comments   
Comment by Ben Beuchler [ 05/Nov/11 ]
Of course I meant "The Debian packages for both membase-server and moxi-server do **not** include the configuration files"
Comment by Farshid Ghods (Inactive) [ 05/Nov/11 ]
not sure if membase and moxi installation are supported on the same machine
Comment by Farshid Ghods (Inactive) [ 05/Nov/11 ]
this is a bug related to our packaging

need to investigate the rpm packagings as well
Comment by Jacob Lundberg [ 05/May/14 ]

This still affects the latest version of Moxi and just deleted our configuration when we upgraded. This is *extremely* annoying and *extremely* easy to fix. Please, please reconsider the resolution of WONTFIX. We are a paying Couchbase customer, for whatever that is worth, and we do not ask for much but we would like to see this happen.

From the root of the debian/ubuntu package, you can resolve this issue with one command (and then rebuild the package):

echo -e "/opt/moxi/etc/moxi.cfg\n/opt/moxi/etc/moxi-cluster.cfg\n/opt/moxi/etc/moxi-init.d" >> debian/conffiles

Likewise in RPM you can add to the %files section of the spec file so it looks like this:

%files
# ... other files ...
%config(noreplace) /opt/moxi/etc/moxi.cfg
%config(noreplace) /opt/moxi/etc/moxi-cluster.cfg
%config /opt/moxi/etc/moxi-init.d
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by 3.0 RC.




[MB-11848] CLI is using deprecated /pools/default/rebalanceProgress REST endpoint Created: 29/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Steve Yen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
From tracking down MB-10260, looks like the /pools/default/rebalanceProgress REST endpoint is deprecated. Need to track down the replacement and have the couchbase-cli start using it for rebalance and rebalance-status commands.




[MB-11638] Outbound mutations not correct when XDCR paused Created: 03/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When XDCR is paused an a workload is ongoing, the "outbound XDCR mutations" value seems to max out at a lower value than there actually are mutations (a few thousand per node?).

Users will want to see how many mutations are outstanding while XDCR is paused to know how much will have to replicate once they resume.

 Comments   
Comment by Aleksey Kondratenko [ 03/Jul/14 ]
There's _no_ way I can do it.

I'm aware of this. Yes, unlike, 2.x we don't update our "items remainging" stats regularly. 2.x wasted some cycles on that on _every_ vbucket update. But at least it had reasonably efficient way of getting that count from couch's count reduction.

In 3.0 simply (at least as of now) have no way of getting that stat anywhere near efficiently.

So current implementation simply does it:

a) once on wakeup

b) by taking difference between highest seqno and replicated seqno (which is mere estimate already)

My understanding is that there's no way current upr implementation can do it anywhere close to efficiently.
Comment by Perry Krug [ 03/Jul/14 ]
Thanks Alk, very much understood.

My perspective on this is simply from that of an administrator pushing buttons and looking at the effect, and then asking me or support why.

What would you think about a once-per-second/5second/10second process that only kicked in while paused and grabbed the latest sequence number to update this particular stat with? In my mind, it would be similar to waking up and immediately sleeping all the replicators just to get a count.

On the one hand I realize it seems a very hackish/ad-hoc process, but on the other, I'm positive that end-users will notice this and ask the question (even if it's documented and release noted).

Comment by Aleksey Kondratenko [ 03/Jul/14 ]
We could poll for stats like that. But that means possibly nontrivial CPU overhead and I'd like to minimize it.
Comment by Perry Krug [ 03/Jul/14 ]
I certainly agree with reducing CPU overhead.

If it's trivial to do so, adding an unsupported interval config and/or on-off capability to this would help the field workaround any problems. Given that it would only take effect when XDCR is paused, we're going to have a decent amount of "free" CPU in comparison to when XDCR was running previously.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th





[MB-11835] Stack-corruption crash opening db, if path len > 250 bytes Created: 28/Jul/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jens Alfke Assignee: Jung-Sang Ahn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
If the path to a database file is more than 250 bytes long, and auto-compaction is enabled, the stack will be corrupted, probably causing a crash. The reason is that compactor_is_valid_mode() copies the path into a stack buffer 256 bytes long and then appends a 5-byte suffix to it.

The buffer needs to be bigger. I suggest using MAXPATHLEN as the size, at least on Unix systems; it's a common Unix constant defined in <sys/param.h>. On Apple platforms the value is 1024.

Backtrace of the crash in the iOS simulator looks like this; apparently __assert_rtn is an OS stack sanity check.

* thread #1: tid = 0x6112d, 0x0269969e libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0269969e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x0265e2c1 libsystem_pthread.dylib`pthread_kill + 101
    frame #2: 0x023a59c9 libsystem_sim_c.dylib`abort + 127
    frame #3: 0x0237053b libsystem_sim_c.dylib`__assert_rtn + 284
  * frame #4: 0x000b3644 HeadlessBee`compactor_is_valid_mode(filename=<unavailable>, config=<unavailable>) + 276 at compactor.cc:774
    frame #5: 0x000bafd9 HeadlessBee`_fdb_open(handle=<unavailable>, filename=<unavailable>, config=0xbfff8ee0) + 201 at forestdb.cc:842
    frame #6: 0x000baede HeadlessBee`fdb_open(ptr_handle=<unavailable>, filename=<unavailable>, fconfig=0xbfff9010) + 158 at forestdb.cc:528

The actual path causing the crash (251 bytes long) was:

/Volumes/Retina2/Users/snej/Library/Developer/CoreSimulator/Devices/F889372A-F7E8-4534-B6B3-C3E23EFE528C/data/Applications/988D316C-31F3-4A05-8EDC-79C86061C7C9/Library/Application Support/CouchbaseLite/test13_itunesindex-db.cblite2/x:artists.viewindex

 Comments   
Comment by Chiyoung Seo [ 29/Jul/14 ]
http://review.couchbase.org/#/c/39993/




[MB-11383] warmup_min_items_threshold setting is not honored correctly in 3.0 warmup. Created: 10/Jun/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Steps to reproduce:

1)In 3.0 node, create default bucket and load 10,000 items using cbworkloadgen.
2)Run below at command line:
curl -XPOST -u Administrator:password -d 'ns_bucket:update_bucket_props("default", [{extra_config_string, "warmup_min_items_threshold=1"}]).' http://127.0.0.1:8091/diag/eval
3)Restart node for setting to take effect. Restart again for warmup with setting.
4)Issue, ./cbstats localhost:11210 raw warmup:
ep_warmup_estimated_key_count: 10000
ep_warmup_value_count: 1115
5)If I repeat above steps on 2.5.1 node I get:
ep_warmup_estimated_key_count: 10000
ep_warmup_value_count: 101

 Comments   
Comment by Abhinav Dangeti [ 08/Jul/14 ]
Likely because of parallelization. Could you tell me the time taken for warmup for the same scenario in 2.5.1 and 3.0.0.




[MB-11689] [cache metadata]: No indication of what percentage of metadata is in RAM Created: 11/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: David Liao
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-07-11 at 11.32.16.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
For the new Cache Metadata feature we don't appear to give the user any indication of how much metadata has been flushed out to disk.

See attached screenshot - while we do show the absolute amount of RAM used for metadata, there doesn't seem to be any indication of how much of the total is still in RAM.

Note: I had a brief look at the available stats (https://github.com/membase/ep-engine/blob/master/docs/stats.org) and couldn't see a stat about total metadata size (flushed to disk); so this may also need ep-engine if there isn't an underlying stat for this.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

David - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-11804] [Windows] Memcached error #132 'Internal error': Internal error for vbucket... when set key to bucket Created: 23/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: Zip Archive 172.23.107.124-7232014-1631-diag.zip     Zip Archive 172.23.107.125-7232014-1633-diag.zip     Zip Archive 172.23.107.126-7232014-1634-diag.zip     Zip Archive 172.23.107.127-7232014-1635-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build from centos build. http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-999-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Test warmup test in build 3.0.0-999 on 4 nodes windows 2008 R2 64-bit
python testrunner.py -i ../../ini/4-w-sanity-new.ini -t warmupcluster.WarmUpClusterTest.test_warmUpCluster,num_of_docs=100

The test failed when it loaded keys to bucket default. This test passed in both centos 6.4 and ubuntu 12.04 64-bit





[MB-11781] [Incremental offline xdcr upgrade] 2.0.1-170-rel - 3.0.0-973-rel, replica counts are not correct Created: 22/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Upgrade from 2.0.1-170 - 3.0.0-973

Ubuntu 12.04 TLS

Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/2f193298/10.3.3.218-7212014-740-diag.zip
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/86678a9d/10.3.3.218-diag.txt.gz
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/a272b793/10.3.3.218-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/46622b23/10.3.3.240-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/6da39af1/10.3.3.240-diag.txt.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/702bdaa2/10.3.3.240-7212014-738-diag.zip

[Destination]

10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ae25e869/10.3.3.225-diag.txt.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f44d13a3/10.3.3.225-7212014-734-couch.tar.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f88f7912/10.3.3.225-7212014-743-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/98090b83/10.3.3.239-7212014-734-couch.tar.gz
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ddb9b54c/10.3.3.239-7212014-741-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/e3ac7b07/10.3.3.239-diag.txt.gz
Is this a Regression?: Unknown

 Description   
[Jenkins]
http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/24/consoleFull

[Test]
/testrunner -i ubuntu_x64--36_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.incremental_offline_upgrade,initial_version=2.0.1-170-rel,sdata=False,bucket_topology=default:1>2;bucket0:1><2,upgrade_seq=src><dest


[Test Steps]
1, Installed Source (2 node) and Destination node(2 node) with 2.0.1-170-rel.
2. Change XDCR global settings: xdcrFailureRestartInterval=1, xdcrCheckpointInterval=60 on both cluster.
3. Setup Remote clusters (Bidirectional).

bucket0 <--> bucket0 (Bi-directional) 10.3.3.240 <---> 10.3.3.239
default ---> default (Uni-directional) 10.3.3.240 -----> 10.3.3.239

4. Load 1000 items on each bucket on Source cluster.
5. Load 1000 items on bucket0 on destination cluster.
6. Wait for replication to finish.
7. Offline Upgrade each node one by one to 3.0.0-973 along with load 1000 items on bucket0 and default at Source cluster.
8. Verify items on side.

Expected items on bucket0 - 6000 and default = 5000


[2014-07-21 09:46:45,612] - [task:463] INFO - Saw vb_active_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:45,628] - [data_helper:289] INFO - creating direct client 10.3.3.239:11210 default
[2014-07-21 09:46:45,732] - [data_helper:289] INFO - creating direct client 10.3.3.225:11210 default
[2014-07-21 09:46:45,811] - [task:463] INFO - Saw vb_replica_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:50,832] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:46:55,852] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:00,872] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:05,892] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:10,912] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:15,933] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:20,954] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:25,974] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:30,995] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:36,018] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:41,040] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:46,062] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:51,085] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:56,106] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:01,128] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:06,150] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:11,173] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket

8.

 Comments   
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th




[MB-11801] It takes almost 2x more time to rebalance 10 empty buckets Created: 23/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-881

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File reb_empty.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/400/artifact/
Is this a Regression?: Yes

 Description   
Rebalance-in, 3 -> 4, 10 empty buckets

There was only one change:
http://review.couchbase.org/#/c/34501/

 Comments   
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.




[MB-11670] Rebuild whole project when header file changes Created: 08/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Volker Mische Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When you change a header file in the view-engine (couchdb project) the whole project should be rebuild.

Currently if you change a header file and you don't clean up the project you could end up with run-time errors like a badmatch on the #writer_acc record.

PS: I opened that as an MB bug and not as a CBD as this is valueable information about badmatch errors that should be public.

 Comments   
Comment by Chris Hillery [ 09/Jul/14 ]
This really has nothing to do with build team, and as such it's perfectly appropriate for it to be MB.

I'm assigning it back to Volker for some more information. Can you give me a specific set of actions you can take that demonstrate this not happening? Is it to do with Erlang code, or C++?
Comment by Volker Mische [ 09/Jul/14 ]
Build Couchbase with a make.

Now edit a couchdb Erlang header file. For example edit couchdb/src/couch_set_view/include/couch_set_view.hrl and comment this block out (with leading `%`):

-record(set_view_params, {
    max_partitions = 0 :: non_neg_integer(),
    active_partitions = [] :: [partition_id()],
    passive_partitions = [] :: [partition_id()],
    use_replica_index = false :: boolean()
}).

When you do a "make" again, ns_server will complain about something missing, but couchdb won't as it doesn't rebuild at all.

Chris, I hope this information is good enough, if you need more, let me know.
Comment by Volker Mische [ 17/Jul/14 ]
Anil, please see the full change history of this issue. It was intentionally set to the view-engine component by Ceej.




[MB-10493] moxi-server RPM – do not overwrite configs on install Created: 18/Mar/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: installer, moxi
Affects Version/s: 1.8.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Alexander Petrossian (PAF) Assignee: Steve Yen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: RHEL5

Operating System: Centos 32-bit

 Description   
Prebuilt RPM contains two configuration files, and on upgrade our live configuration files gets moved to .rpmsave and new (useless in live) files get installed.

That is utterly inconvenient, please consider to stop this practice.
One way to do that: name configs XXXX.template.cfg or something like that.

 Comments   
Comment by Alexander Petrossian (PAF) [ 18/Mar/14 ]
I mean moxi-server RPM and these configuration files:
/opt/moxi/etc/moxi.cfg
/opt/moxi/etc/moxi-cluster.cfg
Comment by Alexander Petrossian (PAF) [ 18/Mar/14 ]
(forgot to mention that in MB-9786)
Comment by Anil Kumar [ 17/Jun/14 ]
Triage - June 17 2014 Bin, Wayne, Anil, Ashvinder, Tony
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Steve/Bin - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-11042] cbstats usage output not clear Created: 05/May/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
This is the output of cbstats without any arguments on 2.5.1

====================
Usage: cbstats [options]

Options:
  -h, --help show this help message and exit
  -a iterate over all buckets (requires admin u/p)
  -b BUCKETNAME the bucket to get stats from (Default: default)
  -p PASSWORD the password for the bucket if one exists
Usage: cbstats host:port all
  or cbstats host:port allocator
......
====================

As a new user this isn't clear to me, which arguments are required and which are optional ? These are usually denoted as a bracket [] for optional arguments. See: http://courses.cms.caltech.edu/cs11/material/general/usage.html

Also, it's not made clear in this output that the host:port option is looking for the dataport (usually 11210).

 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
cbstats belongs to ep_engine.

Triage - June 04 2014 Bin, Ashivinder, Venu, Tony, Anil
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Sundar - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.




[MB-9131] ep_engine pollutes log files on bucket creation Created: 16/Sep/13  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.2.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Artem Stemkovski Assignee: David Liao
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
With these messages

Mon Sep 16 16:17:05.462500 PDT 3: (test) Warning: couchstore_open_db failed, name=/Users/artem/Work/couchbase/ns_server/data/n_1/data/test/975.couch.1 option=2 rev=1 error=no such file [none]
Mon Sep 16 16:17:05.462510 PDT 3: (test) Warning: failed to open database file, name=/Users/artem/Work/couchbase/ns_server/data/n_1/data/test/975.couch.1

This happens because during the warmup stage the vbucket files are not created yet and the following method is invoked: CouchKVStore::listPersistedVbuckets

There are couple of strange things I see in this method:
1. dbFileRevMapPopulated is always false. It can never become true.
2. discoverDbFiles check which vbucket files are actually exist but the following code ignores this information and blindly runs through all possible id's and tries to open db even on nonexisting files which results in bogus error messages in the log

 Comments   
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Potential candidate for 3.0.1.




[MB-11343] [xdcr] ep engine stats mismatch (ep_num_ops_del_meta_res_fail != ep_num_ops_del_meta) biXdcr, data loaded/deleted on source cluster Created: 06/Jun/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication, storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sangharsh Agarwal Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-781
CentOS 5.8

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11343/150648d6/10.1.3.93-652014-947-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11343/c8468ad0/10.1.3.93-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11343/6c4db259/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11343/d2c69cfb/10.1.3.94-652014-950-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11343/14aefe5d/10.1.3.95-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11343/a923cb4c/10.1.3.95-652014-952-diag.zip

[Destination]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11343/aa11fe0c/10.1.3.96-652014-953-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11343/b79849b8/10.1.3.96-diag.txt.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11343/4c02f8da/10.1.3.97-diag.txt.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11343/82ddaf7c/10.1.3.97-652014-956-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11343/4786816e/10.1.3.99-652014-959-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11343/6fd0f64a/10.1.3.99-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11343/a30c5f01/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11343/aadd424f/10.1.2.12-652014-958-diag.zip
Is this a Regression?: Unknown

 Description   
http://qa.hq.northscale.net/job/centos_x64--01_01--uniXDCR_biXDCR-P0/10/consoleFull

[Test]
./testrunner -i centos_x64--01_01--uniXDCR_biXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.xdcrMiscTests.XdcrMiscTests.test_verify_mb8825,items=10000,doc-ops=create-delete,upd=80,del=20,replication_type=xmem,GROUP=P1

Test was added to verify MB-8825.


[Test Steps]
1. Setup XMEM mode bidirectional XDCR between Source and Destination.
2. Start replicaiton From Source -> Destination
3. Load 10000 items on Source node. (Data is loaded and delete on Source node only).
4. Start deleting 20% items when 50% items are loaded on Source. Additionally start replication Destination -> Source.
5. Verify results. Items count, Meta data. --> This step is passed.
6. Verify stats :

   ep_num_ops_set_meta (Source) == 0
   ep_num_ops_del_meta (Source) == 0
   ep_num_ops_del_meta_res_fail (Source) == ep_num_ops_del_meta (Dest) ----> Failed here.
   ep_num_ops_set_meta_res_fail (Source) > 0


     
[Failed logs]

======================================================================
FAIL: test_verify_mb8825 (xdcr.xdcrMiscTests.XdcrMiscTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/xdcr/xdcrMiscTests.py", line 69, in test_verify_mb8825
    self.assertEqual(src_stat_ep_num_ops_del_meta_res_fail, dest_stat_ep_num_ops_del_meta, "Number of failed delete [%s] operation occurs at bucket = %s, while expected to %s" % (src_stat_ep_num_ops_del_meta_res_fail, bucket, dest_stat_ep_num_ops_del_meta))
AssertionError: Number of failed delete [661] operation occurs at bucket = default, while expected to 501

----------------------------------------------------------------------
Ran 1 test in 176.722s



 Comments   
Comment by Sangharsh Agarwal [ 06/Jun/14 ]
[Additional information]
Test passes successfully in 2.5.1-1083 build.
Comment by Sangharsh Agarwal [ 06/Jun/14 ]
Xdcr Dev, Please investigate from XDCR point of view if deletes are handled properly at Source end in this case. Because number deletes received and failed (by conflict resolution) at Source side looks higher, so it is suspected that destination sends more number of deletes as expected.
Comment by Aleksey Kondratenko [ 06/Jun/14 ]
Both of us were unable to understand precisely what's being tested here and what this means.

Aruna, given you're local, may I ask you to look at this. Figure out what's wrong and convey it to us ?
Comment by Aruna Piravi [ 06/Jun/14 ]
This is what the test does -

A --sets------> B ( A---> B uni-xdcr, only sets)
A --deletes--> B ( A receives deletes that it sends to B )
A <------------> B ( B is made to replicate back to A - no external data sets/deletes on B)

Now by very nature of bixdcr, A receives deletes in incoming XDCR ops. However these were the same mutations A sent to B. So A's conflict resolution rejects these deletes.
This test checks if the deletes received by B == deletes that were rejected by A.

So in this case, deletes received by B(ep_num_ops_del_meta) is 501 but the deletes that were rejected by A(ep_num_ops_del_meta_res_fail) is 661. This used to match until 2.5.1.

The question here is - what other delete mutations is source ep-engine rejecting? Or is B wrongly sending A 661 'delWithMeta's while it received only 501?
Comment by Sangharsh Agarwal [ 19/Jun/14 ]
Bug is appearing with latest execution on build 3.0.0-814
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
This is good candidate for first run with new xdcr tracing.
Comment by Aruna Piravi [ 19/Jun/14 ]
I'm unable to reproduce this problem on 3.0.0-840. Sangharsh can you try? My run - https://friendpaste.com/6fEyTjSLxTjXNhCYiRi5PA.

Pls accept http://review.couchbase.org/#/c/38513/ for enabling trace logging on all nodes.
Comment by Sangharsh Agarwal [ 20/Jun/14 ]
I have run the test on build 3.0.0-843 on the same VMs as the Jenkin job, bug is reproduced.

[Test Logs]
https://friendpaste.com/aaOMacVk2NKIeYoSS46P5

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11343/75b3a585/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11343/7fab3314/10.1.3.96-6202014-341-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11343/4d480f2e/10.1.3.97-6202014-341-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11343/9ad97373/10.1.3.97-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11343/5c4a8b79/10.1.3.99-6202014-342-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11343/cfa54697/10.1.3.99-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11343/c3b8699d/10.1.2.12-6202014-343-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11343/cb001b4e/10.1.2.12-diag.txt.gz


[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11343/2eb47317/10.1.3.93-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11343/9b410961/10.1.3.93-6202014-338-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11343/c4b78d6d/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11343/ff7a85d3/10.1.3.94-6202014-339-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11343/0a2c59ef/10.1.3.95-6202014-340-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11343/835646ab/10.1.3.95-diag.txt.gz


Its strange that bug is not occurring on the others VMs while appearing in every execution on this Vms.
But it is also noticed that bug is not appearing with build 2.5.1-1083 on this same set of Vms.
Comment by Sangharsh Agarwal [ 24/Jun/14 ]
Alk, Please check if above logs have required traces that you have merged.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Yes I see traces.
Comment by Aruna Piravi [ 25/Jun/14 ]
Alk, is there a problem with the traces? I do see xdcr_trace.log containing json data. Does QE have an action item on this?
Comment by Aleksey Kondratenko [ 26/Jun/14 ]
No issues with traces.

I was asked if logs contain traces I said yes. Comment above mentions that issues is not occurring anymore so I assumed that there's nothing I need to do here.
Comment by Aruna Piravi [ 26/Jun/14 ]
:) I think Sangharsh is saying he is able to consistently reproduce it on his environment and that you may not be able to, in yours. In any case, he has attached logs belonging to his environment. Can you pls check the logs?
Comment by Sangharsh Agarwal [ 08/Jul/14 ]
Just to update: Bug is consistently occurring with CentOS Job in 3.0.0-918, not occurred with Ubuntu XDCR Job.
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
I can confirm from xdcr trace logs that all mutations send back from destination to source were rejected.

So looks like it's mere stats bug.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.




[MB-8730] change moxi configuration receipt from HTTP streaming to configuraion over memcached protocol Created: 30/Jul/13  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.2.0
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Critical
Reporter: Matt Ingenthron Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: cccp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
To support better faster, more reliable response to configuration changes, moxi should change from HTTP streaming configuration to the new memcached binary protocol method delivered under project CCCP.

Note this is expected to come at a later time than the initial delivery of CCCP.

This is part of project CCCP, as covered at http://www.couchbase.com/wiki/display/couchbase/Cluster+Configuration+Carrier+Publication

 Comments   
Comment by Anil Kumar [ 17/Oct/13 ]
MB-8417
Comment by Anil Kumar [ 20/Mar/14 ]
Steve - can you please update this ticket.
Comment by Matt Ingenthron [ 20/Mar/14 ]
Anil may have more up to date info, but just one comment from the 'peanut gallery' here, I don't think we actually need to solve this with what I know of field issues and current deployments. Moxi has one per instance and this interface isn't going away.
Comment by Steve Yen [ 29/Jul/14 ]
(scrubbing through old bugs)

Will be pretty major surgery inside moxi to get the CCCP response value from its various worker threads to the relevant reconfiguration thread and navigate back up through the retry logic.

Or, perhaps there might be a simpler way in which each worker thread makes its own independent progress, but needs a lot of thought on a (putting it kindly) "stabilized" codebase.

At root, though, agreeing with Matt on not needing to solve this given current deployments, so marking this as Won't Fix.

(Also, I think the reference to MB-8417 might have been a typo?)




[MB-11322] Documentation :: MB-10086: added api docs for clusterwide collectinfo info feature Created: 04/Jun/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Critical
Reporter: Parag Agarwal Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Hi Ruth, new changes to REST API. opening this bug for documentation

MB-10086: added api docs for clusterwide collectinfo info feature

The changes are still being reviewed. You can follow-up for documentation once complete.




[MB-10722] ep-engine gerrit jobs don't check out the latest change Created: 01/Apr/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Mike Wiederhold Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I've had my changes marked verified and then merged them a few times now and noticed that make simple-test doesn't pass when I run it. I appears that the actual change we want to test is not getting pulled into the test.

 Comments   
Comment by Thuan Nguyen [ 01/Apr/14 ]
It happens in testrunner too
Comment by Phil Labee [ 01/Apr/14 ]
Need more info. Please provide an example of a code review that passed the commit validation test, but failed in your testing after the change was submitted.
Comment by Mike Wiederhold [ 01/Apr/14 ]
http://factory.couchbase.com/job/ep-engine-gerrit-300/415/

http://review.couchbase.org/#/c/35035/
Comment by Maria McDuff (Inactive) [ 04/Apr/14 ]
Tommie,

can you take a look? looks like we may need to adjust the testrunner logic.
pls advise.
Comment by Mike Wiederhold [ 29/Jul/14 ]
We no longer have this automation job.




[MB-11009] Discussion:- Should swappiness be set to 1 and not 0 Created: 29/Apr/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.1
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Patrick Varley Assignee: Patrick Varley
Resolution: Unresolved Votes: 0
Labels: customer, swap
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sorry if this not the correct forum to have this discussion but it does affect customers.

Alk I'm not sure if you are the right person to assign this to but I feel you have great knowledge in this area and seen a number of customer cases.

We currently recommend that customers have a swap space and that they set vm.swappiness to zero. I believe the idea behind this is to never use swap unless you really have to i. physical memory is full.

Unfortunately linux kernel has changed this behaviour of vm.swappiness in newer kernels:

http://www.mysqlperformanceblog.com/2014/04/28/oom-relation-vm-swappiness0-new-kernel/

In light of this should we change our recommendation to 1?

Further more do we test our recommendation in extreme cases such as running out of memory? (more a question for test).

 Comments   
Comment by Steve Yen [ 30/Apr/14 ]
From CBSE review mtg...

This one isn't a high priority CBSE. Plan is to convert to regular MB level and get the recommendation and update our documentation in non-alert fashion.
Comment by Aleksey Kondratenko [ 01/May/14 ]
I don't think we have enough evidence to suggest 1. 1 used to be significantly less "enough" than 0. That blog post (which I've seen too, see my recent g+ post) might be a bit premature. It's possible they merely hit some kernel bug which redhat will fix shortly.
Comment by Patrick Varley [ 03/May/14 ]
We have seen a number of cases where OOM kicks in and swap is never used.

For users that are on a kernel affect by this do you thing it is a valid work around until/if redhat fix it?
Comment by Aleksey Kondratenko [ 03/May/14 ]
May I have actual evidence for "OOM kicks in and swap is never used" ? Without evidence I won't believe it.
Comment by Aleksey Kondratenko [ 05/May/14 ]
You haven't pasted all of it. But looking in logs I do see that indeed that's pretty much "it". There's 3 gigs of swap free. And memcached process is allocating just single page (gfp-order=0). And yet it's hitting oom.

Still I'm not sure we should just go ahead and tell everyone to update to 1 because:

* it actually does look like bug and somebody should (if not already) report that to redhat. If someone wants no swap then only disabling swap completely should do it. Swappiness set to 0 (as recommended by lots of vendors) must not disable it completely.

* for earlier releases 1 versus 0 is actually twice more swappiness (the code does swappines + 1 so 0 becomes 1 and 1 becomes 2) and we don't want that

What we can do is to broadcast that there's apparent bug in certain kernel versions. So that folks on that versions of kernel can update to swappiness to 1.
Comment by Patrick Varley [ 15/May/14 ]
One of our customer now have a cron job that runs every min and checks to see how much free memory is left on the system and if it is below 2Gb they change swappiness to 1. This has saved a number of node.
Comment by Steve Yen [ 29/Jul/14 ]
Hi Patrick,
From more reading of the Internet, it appears the Internet hasn't reached a definitive new conclusion on swappiness level other than that mysql blog article.

Wondering if you or team have heard of any additional production/field issues or escalations in the intervening months that would provide more data points & lead us to greater confidence on changing to 1?

Thanks,
Steve




[MB-11493] Moxi Auth Timeout reported as SERVER_ERROR proxy write to downstream Created: 20/Jun/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.5.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Minor
Reporter: James Mauss Assignee: Steve Yen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When there is a delay between the Moxi and the cluster, the message reported in the Moxi logs for the timeout on 11210 is "SERVER_ERROR proxy write to downstream" and should be more clear and state that the error was a timeout.

 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
From my reading of the code, a note for me...

One way to detect auth timeout is to do a "stats proxy" and see if the "tot_auth_timeout" counter is increasing greatly.




[MB-10831] Add new stats to Admin UI 'memory fragmentation outside mem_used' Created: 10/Apr/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, storage-engine
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Steve Yen Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate

 Comments   
Comment by Perry Krug [ 10/Apr/14 ]
I think it might be worthwhile to have an absolute number in terms of amount of memory above-and-beyond mem_used as opposed to a percentage...it may let users be a bit more intelligent about when they need to take action.
Comment by Anil Kumar [ 07/Jul/14 ]
Artem – As discussed it will be useful to have this stats as absolute value rather than percentage (%).

Text for stat – “Fragmented data measured outside of mem_used” in GB
Comment by Artem Stemkovski [ 22/Jul/14 ]
What's the best way to get memory fragmentation out of ep_engine?
I see the stat called total_fragmentation_bytes. Is this the correct stat?

Chiyoung Seo:
From the ep-engine stats,
total_free_bytes (free and mapped pages in the allocator) + total_fragmentation_bytes can be used for this.
Comment by Artem Stemkovski [ 22/Jul/14 ]
Apparently there's no way to get tcmalloc stats out of ep_engine if there are no buckets.
We still need to display system stats even is there are no buckets configured.

So we need a way to query tcmalloc stats globally without logging in as a particular bucket.
Rerouting the ticket to ep_engine




[MB-5212] malformed request crashes Moxi Created: 02/May/12  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 1.8.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
Description from Customer:
We recently found and fixed a thread safety bug in our client code that was causing a few issues, and I suspect that it may have been causing our Moxi problems as well. Since the fix for this bug was deployed last Thursday evening, we've not had a single Moxi process crash. Previously we were seeing at least one Moxi instance crash (at least) every couple of days, and sometimes multiple crashes per day. So six consecutive days of uptime (and counting) is a significant improvement on what we were seeing before.
 
The problem was with two threads sometimes attempting to use the same libmemcached handle at the same time. Since libmemcached handles are not thread-safe, I suspect that libmemcached could have been sending malformed requests, which may in turn have caused the Moxi problems we were seeing.
 
We'll deploy the new version of Moxi anyway, in order to make sure we have all the latest fixes, but I thought you might be interested to know that Moxi seems more stable now that our client-side issue has been fixed. It might also be worth your engineering team spending some time testing whether there are ways in which malformed memcached protocol requests can cause crashes in Moxi.



 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
Likely not going to spend time hardening up moxi to request malformedness.




[MB-5464] moxi assertion fail Created: 06/Jun/12  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 1.8.1, 2.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Jin Lim Assignee: Steve Yen
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows Server (Service Pack 2)

Triage: Untriaged

 Description   
Moxi.exe runs into the assertion described below during the Couchbase Server startup. This causes memcached to restart a couple of times (2 - 3) before completing its warmup.

Assertion fail message from Couchbase Server log:

Port server memcached on node 'ns_1@10.3.121.142' exited with status 255. Restarting. Messages: Assertion failed!

Program: c:\opt\couchbase\bin\moxi.exe
File: cproxy.c, Line 157

Expression: port > 0 || settings.socketpath != NULL

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Memcached.exe command line:
c:\opt\couchbase\bin\memcached.exe -X c:\opt\couchbase\lib\memcached\stdin_term_handler.so -l 0.0.0.0:11210,0.0.0.0:11209:1000 -E c:/opt/couchbase/lib/memcached/bucket_engine.so -B binary -r -c 10000 -e admin=_admin;default_bucket_name=default;auto_create=fal

 Comments   
Comment by Jin Lim [ 06/Jun/12 ]
This appears to be a Windows platform specific issue.
Comment by Slawek [ 12/Sep/12 ]
This is happening to me now every day on two servers Win server 2003 & 2008 with Couchbase 1.8.0 & 1.8.1

I don't see any other errors, so I'm not sure what's causing it. After moxi restarts caching is not working until I restart applications in IIS.

############################################################
Port server moxi on node 'ns_1@192.168.243.12' exited with status 3. Restarting. Messages: 2012-09-10 17:13:21: (cproxy_config.c.317) env: MOXI_SASL_PLAIN_USR (7)
2012-09-10 17:13:21: (cproxy_config.c.326) env: MOXI_SASL_PLAIN_PWD (9)
Assertion failed!

Program: d:\Program Files\Couchbase\Server\bin\moxi.exe
File: cproxy_protocol_b2b.c, Line 56

Expression: uc->noreply == false

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
############################################################
Comment by Steve Yen [ 29/Jul/14 ]
scrubbing ancient moxi bugs...

One way this might happen (uc->noreply == true) is if you're using a quiet binary command and somehow the logic in moxi got bungled. One potential candidate...

memcached.c
  void process_bin_noreply(conn *c) {
    assert(c);
    c->noreply = true;
    switch (c->binary_header.request.opcode) {
    case PROTOCOL_BINARY_CMD_SETQ:
        c->cmd = PROTOCOL_BINARY_CMD_SET;
break;
    ... more cases here ...
    default:
        c->noreply = false;
    }
}

No one else has reported any issues around this, though, so perhaps this (admitedlly, probably too optimistically) was taken care of via other fixes over the years, especially with the latest 3.0 rework.




[MB-11830] {UPR}:: CBTransfer fails {error: could not read _local/vbstate} after Rebalance-in Created: 27/Jul/14  Updated: 29/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket, tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-10.6.2.147

Triage: Triaged
Is this a Regression?: Unknown

 Description   
1035, Centos 6x

1. Create 3 Node cluster
2. Add a default bucket
3. Load 100 K items
4. Rebalance-in 1 node

Rebalance succeeds and after all the queues have drained, we compare the replica items loaded initially vs present after rebalance-in

The test fails since replica items are missing.

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalancein.RebalanceInTests.rebalance_in_after_ops,nodes_in=1,nodes_init=3,replicas=1,items=100000,GROUP=IN;P0

This test case passed for 973 build

https://s3.amazonaws.com/bugdb/jira/MB-11830/replica_keys_missing.txt

https://s3.amazonaws.com/bugdb/jira/MB-11830/1035_log_data_log.tar.gz


 Comments   
Comment by Mike Wiederhold [ 28/Jul/14 ]
I checked two of the vbuckets that supposedly had missing keys.

mike12109 -> vbucket 640
mike12108 -> vbucket 391

Mike-Wiederholds-MacBook-Pro:1035_log_data_log.tar mikewied$ cat 10.6.2.145/stats.log | grep '_391:num_items'
 vb_391:num_items_for_persistence: 0
 vb_391:num_items: 103
Mike-Wiederholds-MacBook-Pro:1035_log_data_log.tar mikewied$ cat 10.6.2.144/stats.log | grep '_391:num_items'
 vb_391:num_items_for_persistence: 0
 vb_391:num_items: 103


In both cases there don't seem to be dataloss. Please include vbuckets that have mismatched items if there are any. It's also not clear how you found these missing keys.
Comment by Parag Agarwal [ 28/Jul/14 ]
we take snap-shots via cbtransfer before and after and then compare. If you run the above mentioned test case using cluster run, it should repro
Comment by Mike Wiederhold [ 28/Jul/14 ]
Parag,

If your getting the keys with cbtransfer then the bug might be there. Is there any verification stats verification done before you run cbtransfer?
Comment by Parag Agarwal [ 28/Jul/14 ]
Mike

the test reads the couch store files via cbtransfer. We can check with Bin if any changes were made.

We do check for stats items like expected active and replica items and queue are drained. After this check we do more detailed data validation

Bin: Was there any changes made that might impact cbtransfer? we using 1035 build and seeing inconsistency in replica items when checking before and after rebalance-in operation
Comment by Mike Wiederhold [ 28/Jul/14 ]
In order to debug this further on the server side I would need the data files. If Bin doesn't find anything from looking at the cbtransfer script then please re-run this test and either let me look at the live cluster or attach the data files along with the logs.
Comment by Bin Cui [ 28/Jul/14 ]
I dont think it is something related to cbtransfer. If you can get right replica data before rebalance and miss some after it, either ep_engine won't provide those missing items, or we miss some items due to seqno change.

1. What if you run a fullback after rebalance? Do you still have missing items?
2. If solely based on incremental backup, maybe the missing data are caused by wrong seqno or failover logs? I am not sure.
Comment by Mike Wiederhold [ 28/Jul/14 ]
I've looked the data files an there doesn't appear to be any signs of data loss. Even the keys that have reported missing can be found in the data files. We will keep investigating this to see why this test is failing, but there doesn't appear to be any data loss on the server at the moment.
Comment by Parag Agarwal [ 28/Jul/14 ]
Bin:: We are hitting the following issue after rebalance

For Replica

Last login: Mon Jul 28 13:31:33 2014 from 10.17.45.173
[root@palm-10307 ~]# /opt/couchbase/bin/cbtransfer couchstore-files:///opt/couchbase/var/lib/couchbase/data csv:/tmp/ab45472c-1695-11e4-9a2f-005056970042.csv -b default -u Administrator -p password --source-vbucket-state=replica
error: could not read _local/vbstate from: /opt/couchbase/var/lib/couchbase/data/default/106.couch.7; exception: Expecting object: line 1 column 81 (char 81)
[root@palm-10307 ~]#

Can you please check, the cluster is live: 10.6.2.144

I was able to repro it
Comment by Parag Agarwal [ 28/Jul/14 ]
Occurs for Active

2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] running command.raw on 10.6.2.145: /opt/couchbase/bin/cbtransfer couchstore-files:///opt/couchbase/var/lib/couchbase/data csv:/tmp/c71f43ee-1695-11e4-9a2f-005056970042.csv -b default -u Administrator -p password
2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] command executed successfully
2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.log_command_output] error: could not read _local/vbstate from: /opt/couchbase/var/lib/couchbase/data/default/127.couch.8; exception: Expecting object: line 1 column 81 (char 81)
Comment by Bin Cui [ 28/Jul/14 ]
It happens when the tools tries to read the couchstore files using couchstore API. The code snippet is as:

                store = couchstore.CouchStore(f, 'r')
                try:
                    doc_str = store.localDocs['_local/vbstate']
                    if doc_str:
                        doc = json.loads(doc_str)
                        state = doc.get('state', None)
                        if state:
                            vbucket_states[state][vbucket_id] = doc
                        else:
                            return "error: missing vbucket_state from: %s" \
                                % (f), None
                except Exception, e:
                    return ("error: could not read _local/vbstate from: %s" +
                            "; exception: %s") % (f, e), None

Need to figure out why couchstore API launches exceptions for this case.
Comment by Bin Cui [ 28/Jul/14 ]
Parag, can you attach the /opt/couchbase/var/lib/couchbase/data/default/106.couch.7 ? So we can figure out why exception thrown.
Comment by Mike Wiederhold [ 28/Jul/14 ]
Sundar,

It's because the failover log is empty. This is likely a regression from the set vbucket state change you made where you cache the failover log.

[root@palm-10307 ~]# cat /opt/couchbase/var/lib/couchbase/data/default/106.couch.7
?)+k
    ?cr?fT?R_local/vbstate{"\": "dead","checkpoint_id\0","max_deleted_seqno": Dfailover_table": }???)????

                                                                                                         "k[root@palm-10307 ~]#
[root@palm-10307 ~]#
Comment by Sundar Sridharan [ 28/Jul/14 ]
thanks Mike, looks like the table with no entries case needs to be handled too
http://review.couchbase.org/39967
Comment by Sundar Sridharan [ 28/Jul/14 ]
fix was merged. Parag, can you please verify the change?
Comment by Parag Agarwal [ 29/Jul/14 ]
repro in 1046 build
Comment by Sundar Sridharan [ 29/Jul/14 ]
Parag, can you again attach the couch file where the error occurred or point me to the machine ? thanks
Comment by Sundar Sridharan [ 29/Jul/14 ]
thanks Parag
Comment by Sundar Sridharan [ 29/Jul/14 ]
The following test now passes with cluster_run...
./testrunner -i ~/palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,skip_cleanup=True,blob_generator=false

..
OK
summary so far suite rebalance.rebalance_progress.RebalanceProgressTests , pass 1 , fail 0

If you see this issue again, please kill the test and point me to the cluster so I can grab the vbucket file which caused the error and debug further.
thanks
Comment by Parag Agarwal [ 29/Jul/14 ]
Somehow I mentioned the wrong test case in the description. It was from another bug and got copied incorrectly

The correct test case is as follows


Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalancein.RebalanceInTests.rebalance_in_after_ops,nodes_in=1,nodes_init=3,replicas=1,items=100000,GROUP=IN;P0
Comment by Sundar Sridharan [ 29/Jul/14 ]
Parag, can you please verify that you posted the correct command.. because when i run it i get an error like this…
Traceback (most recent call last):
  File "./testrunner.py", line 469, in <module>
    watcher()
  File "./testrunner.py", line 454, in watcher
    main() # child runs test
  File "./testrunner.py", line 336, in main
    suite = unittest.TestLoader().loadTestsFromName(before_suite_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/loader.py", line 91, in loadTestsFromName
    module = __import__('.'.join(parts_copy))
ImportError: No module named rebalance_in_after_op
-bash: P0: command not found
Comment by Parag Agarwal [ 29/Jul/14 ]
I have corrected it

 ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalancein.RebalanceInTests.rebalance_in_after_ops,nodes_in=1,nodes_init=3,replicas=1,items=100000,GROUP=IN;P0
Comment by Sundar Sridharan [ 29/Jul/14 ]
I see the failover table being printed out for the vbuckets correctly..
local/vbstate{"d": "active","checkpoint_id?O19","max_deleted_seqno": "0","failover_table": [{"id":255104770402436,"seq":0}]}
Item counts on the buckets seem consistent at 100K. active and replica counts are also ok.
It appears the test is throwing a different failure..
======================================================================
FAIL: rebalance_in_after_ops (rebalance.rebalancein.RebalanceInTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/rebalance/rebalancein.py", line 41, in rebalance_in_after_ops
    disk_replica_dataset, disk_active_dataset = self.get_and_compare_active_replica_data_set_all(self.servers[:self.nodes_init], self.buckets, path=None)
  File "pytests/basetestcase.py", line 1112, in get_and_compare_active_replica_data_set_all
    self.assertTrue(logic, summary)
AssertionError:
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Analyzing for Bucket default
1) Failure :: Deleted Items :: Expected False, Actual True
2) Failure :: Added Items :: Expected False, Actual True
3) Success :: Updated Items

-- since this is test specific, Parag, could you please help identify the exact issue with the test?
Something like a specific cbtransfer failure message would be useful. thanks
Comment by Parag Agarwal [ 29/Jul/14 ]
Live cluster:: http://qa.hq.northscale.net/view/3.0.0/job/centos_x64--02_01--Rebalance-In/79/console

Ran this command

2014-07-29 16:55:40 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] running command.raw on 10.6.2.144: /opt/couchbase/bin/cbtransfer couchstore-files:///opt/couchbase/var/lib/couchbase/data csv:/tmp/d81eb134-177b-11e4-b2bd-005056970042.csv -b default -u Administrator -p password
2014-07-29 16:55:40 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] command executed successfully

Error

2014-07-29 16:55:40 | INFO | MainProcess | test_thread | [remote_util.log_command_output] error: could not read _local/vbstate from: /opt/couchbase/var/lib/couchbase/data/default/106.couch.8; exception: Expecting object: line 1 column 81 (char 81)





[MB-10222] Couchbase mac installation generates 0B Couchbase.log files, these files look redundant Created: 14/Feb/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ketaki Gangal Assignee: Bin Cui
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
The current mac installation shows Couchbase.logs on ~/Library/Logs/ which contain no data

-rw-r--r-- 1 couchbase staff 0B Feb 13 18:32 Couchbase.log
-rw-r--r-- 1 couchbase staff 0B Feb 12 14:22 Couchbase.log.old

But these files are empty. The actual log files are generated here
~/Library/Application Support/couchbase/var/lib/couchbase/logs.

What logs are expected at ~/Libray/Logs? Else this should be moved out.



 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
Those Couchbase.log files are different than the usual server-side log files that we're usually familiar with.

The Couchbase.log files are supposedly written to when our Mac OSX launcher widget (the top-right launcher icon with the little couch picture on the OSX toolbar) is doing stuff...

https://github.com/couchbase/couchdbx-app/blob/master/Couchbase%20Server/Couchbase_ServerAppDelegate.m#L146




[MB-10927] couchbase-cli throws out incorrect output when actually command executed successfully in --cluster-init Created: 22/Apr/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Triaged
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-594 on ubuntu 12.04 64-bit
In CLI test, the option --cluster-init test failed due to incorrect output printout.
The command executed successfully but output saying "ERROR: command: cluster-init: 192.168.171.151:8091, [Errno 111] Connection refused"
I said the command executed ok and change the configuration of couchbase server due to the next command with new credentials and port was also executed successfully
as in following run


root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
ERROR: command: cluster-init: 192.168.171.151:8091, [Errno 111] Connection refused
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
SUCCESS: init localhost
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost:8099 -u Administrator1 -p password1 --cluster-username=Administrator --cluster-password=password --cluster-port=8091 --cluster-ramsize=300
ERROR: command: cluster-init: localhost:8099, [Errno 111] Connection refused
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli cluster-init -c 192.168.171.151:8091 --cluster-username=Administrator1 --cluster-password=password1 --cluster-port=8099 -u Administrator -p password --cluster-ramsize=300
SUCCESS: init 192.168.171.151


 Comments   
Comment by Bin Cui [ 23/Apr/14 ]
It is up to ns_server to return connection refused error.
Comment by Wayne Siu [ 13/May/14 ]
Alk,
why do we get the "connection refused" message? does it indicate there is other issue?
Comment by Aleksey Kondratenko [ 10/Jun/14 ]
It is expected for cluster nodes to be temporarily unavailable during change of port as well as during joining cluster or leaving cluster.

Also note that change of port as well as change of nearly all our settings is async. Especially in distributed setting. There is already ticket (and planned work) to make waiting for result of any REST API possible in unified fashion.
Comment by Anil Kumar [ 10/Jun/14 ]
Alk - lets add the ticket you mentioned and close this ticket as 'won't fix' for now.

Triage - June 10 2014 Anil
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
Anil's comment about is a bit incorrect. I have create MB-11484 for a generic problem of observing completion of changes.

But for this specific issue, I believe that we have no choice but have CLI implement "poll for when node ready".
Comment by Aleksey Kondratenko [ 19/Jun/14 ]
And I don't think it needs to be part of 3.0
Comment by Steve Yen [ 29/Jul/14 ]
Spoke with Bin and based on Alk's comment, looks like fixing the race here with new/tweaked protocol will be too big a thing at this point in dev cycle. Pushing to 3.0.1.

(In particular, Bin mentions some qualms about polling from the CLI rather than server blocking, but that's for post-3.0 timeframe.)




[MB-11186] Revisit some failed test cases for couchbase-cli, which are commented out now Created: 22/May/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Steve Yen [ 29/Jul/14 ]
Reviewed these with Bin and when run individually, he says the majority of these work; but, when run all together, there seem to be interactions that make them fail.

As of 2014/07/29, commented out testcases in pump_dcp_test.py include...

    def __test_rejected_auth(self):
    def __test_close_after_auth(self):
    def __test_full_diff(self):
    def __test_full_diff_diff_acc(self):
    def __test_2_mutation_chopped_header(self):
    def __test_delete_ack(self):
    def __test_noop(self):
    def __test_tap_cmd_opaque(self):
    def __test_flush_all(self):
    def __test_restore_1M_blob(self):
    def __test_restore_30M_blob(self):
    def __test_restore_batch_max_bytes(self):
    def __test_immediate_not_my_vbucket_during_restore(self):
    def __test_later_not_my_vbucket_during_restore(self):
    def __test_immediate_not_my_vbucket_during_restore_1T(self):
    def __test_immediate_not_my_vbucket_during_restore_5T(self):
    def __test_immediate_not_my_vbucket_during_restore_5B(self):
    def __test_rejected_auth(self):




[MB-11845] couch_compact: Handle the case when couchstore_open_db_ex() fails Created: 29/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Patrick Varley Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: compaction, customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency

 Description   
If couchstore_open_db_ex fails in couchstore_compact_db_ex, couchstore_close_db would cause a segmentation fault.

 Comments   
Comment by Abhinav Dangeti [ 29/Jul/14 ]
http://review.couchbase.org/#/c/40031/




[MB-11842] [windows] compile error: cannot find platform.h Created: 29/Jul/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Phil Labee Assignee: Trond Norbye
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://factory.hq.couchbase.com:8080/job/cs_300_win6408/

Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://factory.hq.couchbase.com:8080/job/cs_300_win6408/589/consoleFull
Is this a Regression?: Unknown

 Description   

cb_time.c
C:\Jenkins\workspace\cs_300_win6408\couchbase\platform\src\cb_time.c(26) : fatal error C1083: Cannot open include file: 'platform.h': No such file or directory


Commit validation tests passed, but they are on linux.

 Comments   
Comment by Trond Norbye [ 29/Jul/14 ]
http://review.couchbase.org/#/c/40020/
Comment by Phil Labee [ 29/Jul/14 ]
3.0.0-1054




[MB-9232] /etc/init.d/couchbase-server stop shows no status of success Created: 08/Oct/13  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: installer
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.4


 Description   
[root@cb2 ~]# /etc/init.d/couchbase-server restart
Stopping couchbase-server
Starting couchbase-server [ OK ]


There should be an indication of 'OK' or 'FAILED' for the stopping of Couchbase Server

 Comments   
Comment by Steve Yen [ 09/Jan/14 ]
Hi Bin,
This feels like a printf or echo or equivalent could be easily added to one of those linux startup scripts?
Steve
Comment by Bin Cui [ 29/Jul/14 ]
http://review.couchbase.org/#/c/40016/




[MB-11793] Build breakage in upr-consumer.cc Created: 22/Jul/14  Updated: 29/Jul/14  Due: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: None
Affects Version/s: .master
Fix Version/s: .master
Security Level: Public

Type: Task Priority: Test Blocker
Reporter: Chris Hillery Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: ep-engine
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Commit 8d636bbb02b0338df9e73c2573422b6463feb92d to ep-engine appears to be breaking the build on most platforms, eg:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-master-builder/builds/890/steps/couchbase-server%20make%20enterprise%20/logs/stdio

 Comments   
Comment by Mike Wiederhold [ 23/Jul/14 ]
Just want to note here that this does not affect 3.0 builds in case anyone is looking at the ticket. The merge of the memcached 3.0 branch is linked below. Since I don't think anyone is working on the master branch I'm going to wait for someone to review the change.

http://review.couchbase.org/#/c/39708/




[MB-11713] DCP logging needs to be improved for view engine and xdcr Created: 11/Jul/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After looking at one of the system tests I found the following number of log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | wc -l
1061224 // Total log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep xdcr | wc -l
1033792 // XDCR related upr log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep -v xdcr | grep UPR | wc -l
3730 // Rebalance related UPR messages

In this case 97% of all log messages are for XDCR UPR streams.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Mike Wiederhold [ 29/Jul/14 ]
Marking as won't fix for now. The indexer and view engine should change they way they use upr in a future release and I cannot think of a better way to reduce the log messages based on their usage.




[MB-11832] [System Test] Rebalance + Indexing stuck on Rebalance-In in light DGM setup Created: 28/Jul/14  Updated: 29/Jul/14  Resolved: 29/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Ketaki Gangal
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1021-rel

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
1. Create 2 buckets, 1 ddoc,2 Views each
2. Load 120M, 113M items on respective buckets, dgm 70%
3. Wait for initial indexing to complete
4. Rebalance In 1 node - Rebalance is stuck at about 0%

-- Seeing a few error messages on server timeouts.

upr client (default, mapreduce_view: default _design/ddoc1 (prod/main)): Obtaining mutation from server timed out after 60.0 seconds [RequestId 353650, PartId 7, StartSeq 152559, EndSeq 152824]. Waiting...

-- Attaching logs.

 Comments   
Comment by Parag Agarwal [ 28/Jul/14 ]
We had hit another bug with 1035: https://www.couchbase.com/issues/browse/MB-11827, which has indexing stuck during rebalance
Comment by Ketaki Gangal [ 28/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/11832/index.tar

and https://s3.amazonaws.com/bugdb/11832/part2.tar
Comment by Mike Wiederhold [ 28/Jul/14 ]
Ketaki,

Did this rebalance eventually complete? I know you guys on;y check that rebalance progresses for a a minute or so and if it doesn't then you end the test. After looking at the logs I think this is only a temporary stuck issue. Please confirm.
Comment by Ketaki Gangal [ 29/Jul/14 ]
The rebalance was stuck for over a day. The way I could resolve this was -
1. Stop Rebalance
2. Allow indexing on ddoc2 to complete.
3. Restart rebalance
Comment by Ketaki Gangal [ 29/Jul/14 ]
Same phase/ rebalance-in runs ok on newer build 3.0.0-1037-rel.
Comment by Mike Wiederhold [ 29/Jul/14 ]
This needs to be re-run on the latest build. I fixed a rebalance stuck issue yesterday.
Comment by Mike Wiederhold [ 29/Jul/14 ]
Duplicate of MB-11827. I just looked at the logs and found the same symptoms. Please re-open if this issue is still happening on builds 1040 or higher.




[MB-11827] {UPR} :: Rebalance stuck with rebalance-out due to indexing stuck Created: 26/Jul/14  Updated: 29/Jul/14  Resolved: 28/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-10.6.2.147

Triage: Untriaged
Is this a Regression?: Yes

 Description   
1033, centos 6x

1. Create 4 node cluster
2. Create default bucket
3. Add 1 K items
4. Create 3 views and start querying
5. Rebalance-out 1 node

Step 4 and Step 5 act in parallel

Rebalance is stuck

looked at couchdb log, found the following error across different machines in cluster

[couchdb:error,2014-07-26T18:47:27.450,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16158.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.557,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16199.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.745,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16230.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.783,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16240.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.831,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16250.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.877,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16260.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

TEST CASE ::
./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_rebalance_out,nodes_init=4,nodes_out=1,GROUP=P0,skip_cleanup=True,blob_generator=false

 Comments   
Comment by Parag Agarwal [ 26/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11827/1033log.tar.gz

Comment by Parag Agarwal [ 26/Jul/14 ]
Test Case failing:: http://qa.hq.northscale.net/view/3.0.0/job/centos_x64--02_05--Rebalance_Progress/

Check the first 6, rebalance hangs
Comment by Aleksey Kondratenko [ 26/Jul/14 ]
Indeed we're waiting for indexes.
Comment by Sarath Lakshman [ 28/Jul/14 ]
T