[MB-11866] beam.smp RSS suddenly increases to 45GB after unexpected_binary error Created: 01/Aug/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1069

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = 2 x SSD

Attachments: PNG File beam.smp_rss.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/leto/429/artifact/
Is this a Regression?: Yes

 Description   
Mixed KV + queries rebalance test, 3 -> 4 nodes, 1 bucket x 100M x 2KB, 1 x 1 views, 10K ops/sec, 400 qps

After almost 4 hours rebalance completed but then beam.smp memory started growing and eventually caused OOM situation.

[couchdb:error,2014-08-01T0:31:05.986,ns_1@172.23.100.30:<0.30614.322>:couch_log:error:44]Uncaught error in HTTP request: {error,
                                 {badmatch,
                                  {unexpected_binary,
                                   {at,1158389762},
                                   {wanted_bytes,419601992},
                                   {got,2602207,

Quite typical problem.

 Comments   
Comment by Nimish Gupta [ 01/Aug/14 ]
It looks to me index file is corrupted. Pavel, is this easily reproducible?
Comment by Pavel Paulau [ 01/Aug/14 ]
Dunno, it only happened once recently.
Comment by Pavel Paulau [ 01/Aug/14 ]
I'm more concerned about the way you handle errors like this one.
Index corruption should not crash the entire system.




[MB-11857] [System Test] Indexing stuck on intial load Created: 31/Jul/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Ketaki Gangal Assignee: Meenakshi Goel
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Live cluster available on 10.6.2.163:8091

Build : 3.0.0-1059

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
1. Create a 7 node cluster, 2 buckets, 1 ddoc, 2 Views
2. Load 120M, 180M items on the bcukets.
3. Wait for indexing to complete.

-- Indexing appears to be stuck on the cluster (over 12 hours)
--- Error on couchdb logs shows a couple of errors on couch_set_view_updater,'-load_changes

[couchdb:error,2014-07-30T14:10:37.494,ns_1@10.6.2.167:<0.17752.67>:couch_log:error:44]Set view `default`, main (prod) group `_design/ddoc1`, received error from updater: {error,
                                                                                     vbucket_stream_already_exists}
[couchdb:error,2014-07-30T14:10:37.499,ns_1@10.6.2.167:<0.4435.98>:couch_log:error:44]Set view `default`, main group `_design/ddoc1`, doc loader error
error: function_clause
stacktrace: [{couch_set_view_updater,'-load_changes/8-fun-0-',
                 [vbucket_stream_not_found,{8,149579}],
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                  {line,461}]},
             {couch_upr_client,receive_events,4,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                  {line,854}]},
             {couch_upr_client,enum_docs_since,8,


[root@centos-64-x64 bin]# ./cbstats localhost:11210 all | grep upr
 ep_upr_conn_buffer_size: 10485760
 ep_upr_enable_flow_control: 1
 ep_upr_enable_noop: 1
 ep_upr_max_unacked_bytes: 524288
 ep_upr_noop_interval: 180

Attaching logs.





 Comments   
Comment by Ketaki Gangal [ 31/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/11857/bug.tar
Comment by Volker Mische [ 31/Jul/14 ]
Sarath has already a test blocker, hence I take it.
Comment by Volker Mische [ 31/Jul/14 ]
First finding: node 10.6.2.163 does OOM kill things before the view errors occur. beam.smp takes 11gb of RAM.
Comment by Volker Mische [ 31/Jul/14 ]
I was wrong the OOM kill seems to have happened before this test was run.

Would it be possible to also get the information at which time the test was started? Sometimes the logs are trimmed, so it's hard to tell.
Comment by Meenakshi Goel [ 31/Jul/14 ]
Test was started at 2014-07-30 08:48
Comment by Volker Mische [ 31/Jul/14 ]
As the system was still running I could expect it. The DCP client is waiting for a message (probable a close stream one) but never received it. It is probably in and endless loop receiving it.

I'll add a log message when this is happening.

That's the stack trace where it is stuck:

erlang:process_info(list_to_pid("<0.1985.19>"), current_stacktrace).
{current_stacktrace,[{couch_upr_client,get_stream_event_get_reply,
                                       3,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,201}]},
                     {couch_upr_client,get_stream_event,2,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,196}]},
                     {couch_upr_client,receive_events,4,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,846}]},
                     {couch_upr_client,enum_docs_since,8,
                                       [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                                        {line,248}]},
                     {couch_set_view_updater,'-load_changes/8-fun-2-',12,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,510}]},
                     {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                     {couch_set_view_updater,load_changes,8,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,574}]},
                     {couch_set_view_updater,'-update/8-fun-2-',14,
                                             [{file,"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_set_view/src/couch_set_view_updater.erl"},
                                              {line,267}]}]}
Comment by Volker Mische [ 31/Jul/14 ]
Here's the additional log message: http://review.couchbase.org/40107

if no one objects (perhaps someone wants to take a look at the cluster), I'll ask Meenakshi to re-run the test once this is merged.
Comment by Sriram Melkote [ 01/Aug/14 ]
Thanks a lot Volker, sounds good to me. Meenakshi, can you please re-run with additional logging?
Comment by Meenakshi Goel [ 01/Aug/14 ]
Yes will do once this fix for additional logging gets merged.
Comment by Volker Mische [ 01/Aug/14 ]
The commit with the log message got merged. Meenakshi, please rerun the test.




[MB-11864] Adding bucket leads to memcached crash Created: 31/Jul/14  Updated: 01/Aug/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
1077

Scenario:: In a 3 node cluster add a default bucket

Result: Memcached crashes with the following exception

Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 139. Restarting. Messages: Thu Jul 31 18:34:41.633445 PDT 3: (default) Trying to connect to mccouch: "127.0.0.1:11213"
Thu Jul 31 18:34:41.634069 PDT 3: (default) Connected to mccouch: "127.0.0.1:11213"
Thu Jul 31 18:34:41.640616 PDT 3: (No Engine) Bucket default registered with low priority
Thu Jul 31 18:34:41.640681 PDT 3: (No Engine) Spawning 4 readers, 4 writers, 1 auxIO, 1 nonIO threads


 Comments   
Comment by Parag Agarwal [ 31/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11864/core.tar.gz

https://s3.amazonaws.com/bugdb/jira/MB-11864/log_memcached_crash.tar.gz
Comment by Chiyoung Seo [ 31/Jul/14 ]
http://review.couchbase.org/#/c/40156/

Can you test it again when the build is ready?
Comment by Parag Agarwal [ 31/Jul/14 ]
repro with the latest build 1078
Comment by Chiyoung Seo [ 01/Aug/14 ]
I didn't see any crashes with the build 1075 which has the ep-engine revision (3aaaaa4f67c847994994ddfdacf42733b1489182), but saw the crash with the build 1081 that has the same ep-engine revision as the build 1075.
Comment by Pavel Paulau [ 01/Aug/14 ]
Was that caused by tcmalloc changes again?




[MB-11405] Shared thread pool: high CPU overhead due to OS level context switches / scheduling Created: 11/Jun/14  Updated: 01/Aug/14

Status: In Progress
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-805

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File cpu.png     PNG File max_threads_cpu.png     PNG File memcached_cpu_b988.png     PNG File memcached_cpu_toy.png     Text File perf_b829.log     Text File perf_b854_8threads.log     Text File perf.log    
Issue Links:
Relates to
relates to MB-11434 600-800% CPU consumption by memcached... Closed
relates to MB-11738 Evaluate GIO CPU utilization on syste... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.17.zip
http://ci.sc.couchbase.com/job/perf-dev/424/artifact/172.23.100.18.zip
Is this a Regression?: Yes

 Description   
Originally reported as "~2400% CPU consumption by memcached during ongoing workload with five (5) buckets ".

The CPU usage for memcached process is more than two times the usage in the previous release. This is due to increased scheduling overhead from the shared thread pool.
Workaround: Reduce the number of threads on systems that have more than 30 cores.

2 nodes, 5 buckets
1M docs (clusterwise), equally distributed, non-DGM
10K mixed ops/sec (85% reads, 1% creates, 1% deletes, 13% updates; clusterwise), equally distributed

CPU utilization in 2.5.1: ~300%
CPU utilization in 3.0.0: ~2400%



 Comments   
Comment by Pavel Paulau [ 12/Jun/14 ]
Interesting chart that shows how CPU utilization depends on #buckets (2-18) and #nodes (2, 4, 8).
Comment by Sundar Sridharan [ 16/Jun/14 ]
More nodes means fewer vbuckets per node, resulting in fewer writer tasks which may explain the lowered cpu per node.
Here is a partial fix based on the attached perf.log http://review.couchbase.org/38337 that I hope will help.
more fixes may follow if needed. thanks
Comment by Sundar Sridharan [ 16/Jun/14 ]
hi Pavel, the fix to reduce the getDescription() noise has been merged.
Could you please help re-run the workload and see if we still have a high CPU usage and if so what does the new profiler output look like? thanks
Comment by Pavel Paulau [ 18/Jun/14 ]
Still high CPU utilization.
Comment by Sundar Sridharan [ 18/Jun/14 ]
Thanks Pavel, Looks like the getDescription() noise has gone away. However this performance result is quite interesting - 85% of the overhead is from the kernel - most likely context switching from the higher number of threads. This will require some more creative solutions to reduce this cpu usage without suffering a performance overhead.
Comment by Sundar Sridharan [ 20/Jun/14 ]
another fix to reduce active system cpu usage by letting only 1 thread snooze while others sleep is located here http://review.couchbase.org/38620 thanks
Comment by Sundar Sridharan [ 20/Jun/14 ]
Pavel, the fix has been merged. Local testing showed marginal improvement. could you please retry the test and let me know if it helps in the larger setup? thanks
Comment by Pavel Paulau [ 20/Jun/14 ]
Ok, will do. Any expected side effects?
Comment by Pavel Paulau [ 21/Jun/14 ]
I have tried build 3.0.0-854 which includes your change. No impact on performance, still very high CPU utilization.

Please notice that CPU consumption drops to ~400% when I decrease number of threads from 30 (auto-tuned) to 8.
Comment by Sundar Sridharan [ 23/Jun/14 ]
Reducing the number of threads should not be the solution. The main new thing in 3.0 is we can have 4 writer threads per bucket essentially so with 5 buckets we may have 20 writing threads. In 2.5 there would only be 5 writing threads for 5 buckets.
This means we should not expect lower than 4 times the CPU use from 2.5, simply because the cost of increased cpu is bringing us lowered disk write latency.
Comment by Pavel Paulau [ 23/Jun/14 ]
Fair enough.

In this case resolution criterion for this ticket should be 600% CPU utilization by memcached.
Comment by Chiyoung Seo [ 26/Jun/14 ]
Another fix was merged:

http://review.couchbase.org/#/c/38756/
Comment by Pavel Paulau [ 26/Jun/14 ]
Sorry,

The same utilization - build 3.0.0-884.
Comment by Sundar Sridharan [ 30/Jun/14 ]
a debugging fix was merged here at http://review.couchbase.org/#/c/38909/. if possible could you please leave the cluster with this change on for sometime for debugging? thanks
Comment by Pavel Paulau [ 30/Jun/14 ]
There might be a delay in getting results due limited h/w resources and upcoming beta release.
Comment by Pavel Paulau [ 01/Jul/14 ]
Assigning back to Sundar because he is working on his own test.
Comment by Pavel Paulau [ 02/Jul/14 ]
Promoting to "Blocker", it currently seems to be one of the most severe performance issues in 3.0.
Comment by Sundar Sridharan [ 02/Jul/14 ]
Pavel, could you try setting max_threads=20 and re-trying the workload to see if this reduces the CPU overhead to unblock other performance testing? thanks
Comment by Pavel Paulau [ 02/Jul/14 ]
Will do, after beta release.

But please notice that performance testing is not blocked.
Comment by Pavel Paulau [ 04/Jul/14 ]
Some interesting observations...

For the same workload I compared number of scheduler wake ups.
3.0-beta with 4 front-end threads and 30 ep-engine threads (auto-tuned):

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '47284':

         7,940,880 sched:sched_wakeup

      30.000548575 seconds time elapsed

2.5.1 with default settings:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '3677':

           117,003 sched:sched_wakeup

      30.000550702 seconds time elapsed
 
Not surprisingly more write heavy workload (all ops are updates) reduces CPU utilization (down to 600-800%) and scheduling overhead:

$ perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '22699':

         4,014,534 sched:sched_wakeup

      30.000556091 seconds time elapsed

Obviously global IO works nice when IO workload is pretty aggressive and there is always work do.
And it's absolutely crazy when there is a need to constantly put to sleep and wake up threads, which is not uncommon.
Comment by Sundar Sridharan [ 07/Jul/14 ]
Thanks Pavel, as discussed, could you please update the ticket with the results from thread throttling on your 48 core setup?
Comment by Pavel Paulau [ 07/Jul/14 ]
btw, it has only 40 cores/vCPU.
Comment by Sundar Sridharan [ 08/Jul/14 ]
Thanks for the graph Pavel - this confirms our theory that with higher number of threads our scheduling is not able to put threads to sleep in an efficient manner.
Comment by Sundar Sridharan [ 13/Jul/14 ]
Fix for distributed sleep uploaded for review - this is expected to lower the scheduling overhead http://review.couchbase.org/#/c/39210/ thanks
Comment by Sundar Sridharan [ 15/Jul/14 ]
Hi Pavel, could you please let us know if the fix in this toy build shows any cpu improvement?
couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64.rpm
thanks
Comment by Pavel Paulau [ 16/Jul/14 ]
I assume you meant couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-703-toy.rpm

See my comment in MB-11434.
Comment by Pavel Paulau [ 16/Jul/14 ]
Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/498/artifact/
Comment by Sundar Sridharan [ 18/Jul/14 ]
Dynamically configurable thread limits fix uploaded for review http://review.couchbase.org/#/c/39475/
Expected to mitigate heavy cpu usage and allow tunable testing
Comment by Chiyoung Seo [ 18/Jul/14 ]
The change was merged.

Pavel, please test it again when you have time.
Comment by Pavel Paulau [ 20/Jul/14 ]
This is how it looks now.

Logs:
http://ci.sc.couchbase.com/view/lab/job/perf-dev/501/artifact/
Comment by Sundar Sridharan [ 21/Jul/14 ]
From the performance logs uploaded, it looks like with recent changes memcached's CPU usage dropped from
  85.20% memcached [kernel.kallsyms] [k] _spin_lock
……...down to………..
  16.01% memcached [kernel.kallsyms] [k] _spin_lock

That is a 5X improvement which means we are looking at about 500 % usage which is just a marginal increase over the 300% cpu usage 2.5, but with better consolidation.
Could you please then close this bug if you find this satisfactory? thanks
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, another fix to address a CPU hotspot issue in the persistence path has been uploaded for review. Sorry to ask again, but could you please retest with this fix http://review.couchbase.org/#/c/39645
Comment by Pavel Paulau [ 29/Jul/14 ]
I just tried build 3.0.0-1045 and test case with 5 buckets.

CPU utilization is still very high (~2400%) and resources are mostly spent in kernel space:

# sar -u 4
Linux 2.6.32-431.17.1.el6.x86_64 (atlas-s310) 07/28/2014 _x86_64_ (40 CPU)

10:57:44 PM CPU %user %nice %system %iowait %steal %idle
10:57:48 PM all 6.36 0.00 49.99 1.09 0.00 42.56
10:57:52 PM all 6.09 0.00 50.86 0.90 0.00 42.14
10:57:56 PM all 6.28 0.00 46.59 1.13 0.00 46.00
10:58:00 PM all 6.15 0.00 48.49 0.93 0.00 44.43
10:58:04 PM all 6.01 0.00 48.77 1.14 0.00 44.08
10:58:08 PM all 6.22 0.00 48.21 1.14 0.00 44.44

Rate of wakeups is high as well:

# perf stat -e sched:sched_wakeup -p `pgrep memcached` -a sleep 30

 Performance counter stats for process id '29970':

         8,888,980 sched:sched_wakeup

      30.013133143 seconds time elapsed

From perf profiler:

    82.33% memcached [kernel.kallsyms] [k] _spin_lock

https://s3.amazonaws.com/bugdb/jira/MB-11405/perf_b1045.log
Comment by Sundar Sridharan [ 31/Jul/14 ]
fix: http://review.couchbase.org/#/c/40080/ and
fix: http://review.couchbase.org/#/c/40084/
is expected to reduce cpu context switching overhead and also improve bgfetch latencies back to 2.5.1 levels
Pavel, could you please verify this from your setup?
thanks
Comment by Chiyoung Seo [ 31/Jul/14 ]
Pavel,

The above two changes were just merged. I hope these finally resolve the issue :)
Comment by Pavel Paulau [ 01/Aug/14 ]
It finally helps, CPU utilization drops from ~2400% to ~450%, from profiler:

    25.75% memcached [kernel.kallsyms] [k] _spin_lock

Do you plan any other improvements?




[MB-11846] Compiling breakdancer test case exceeds available memory Created: 29/Jul/14  Updated: 01/Aug/14  Due: 30/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Chris Hillery Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
1. With memcached change 4bb252a2a7d9a369c80f8db71b3b5dc1c9f47eb9, cc1 on ubuntu-1204 quickly uses up 100% of the available memory (4GB RAM, 512MB swap) and crashes with an internal error.

2. Without Trond's change, cc1 compiles fine and never takes up more than 12% memory, running on the same hardware.

 Comments   
Comment by Chris Hillery [ 29/Jul/14 ]
Ok, weird fact - on further investigation, it appears that this is NOT happening on the production build server, which is an identically-configured VM. It only appears to be happening on the commit validation server ci03. I'm going to temporarily disable that machine so the next make-simple-github-tap test runs on a different ci server and see if it is unique to ci03. If it is I will lower the priority of the bug. I'd still appreciate some help in understanding what's going on either way.
Comment by Trond Norbye [ 30/Jul/14 ]
Please verify that the two builders have the same patch level so that we're comparing apples with apples.

It does bring up another interesting topic. should our builders just use the compiler provided with the installation, or should we have a reference compiler we're using to build our code. It does seems like a bad idea having to support a ton of various compiler revision (including the fact that they support different levels of C++11 that we have to work around).
Comment by Chris Hillery [ 31/Jul/14 ]
This is now occurring on other CI build servers in other tests - http://www.couchbase.com/issues/browse/CBD-1423

I am bumping this back to Test Blocker and I will revert the change as a work-around for now.
Comment by Chris Hillery [ 31/Jul/14 ]
Partial revert committed to memcached master: http://review.couchbase.org/#/c/40152/ and 3.0: http://review.couchbase.org/#/c/40153/
Comment by Trond Norbye [ 01/Aug/14 ]
That review in memcached should NEVER have been pushed through. Its subject line is too long
Comment by Chris Hillery [ 01/Aug/14 ]
If there's a documented standard out there for commit messages, my apologies; it was never revealed to me.
Comment by Trond Norbye [ 01/Aug/14 ]
When it doesn't fit within a terminal window there is a problem. it is way better to use multiple lines..

IN addition I'm not happy with the fix. instead of deleting the line it should have been checking for an environment variable so that people could explicitly disable it. This is why we have review cycles.
Comment by Chris Hillery [ 01/Aug/14 ]
I don't think I want to get into style arguments. If there's a standard I'll use it. In the meantime I'll try to keep things to 72-character lines.

As to the content of the change, it was not intended to be a "fix"; it was a simple revert of a change that was provably breaking other jobs. I returned the code to its previous state, nothing more or less. And especially given the time crunch of the beta (which is supposed to be built tomorrow), waiting for a code review on a reversion is not in the cards.
Comment by Trond Norbye [ 01/Aug/14 ]
The normal way of doing a revert is to use git revert (which as an extra bonus makes the commit message contain that).
Comment by Trond Norbye [ 01/Aug/14 ]
http://review.couchbase.org/#/c/40165/
Comment by Chris Hillery [ 01/Aug/14 ]
1. Your fix is not correct, because simply adding -D to cmake won't cause any preprocessor defines to be created. You need to have some CONFIGURE_FILE() or similar to create a config.h using #cmakedefine. As it is there is no way to compile with your change.

2. The default behaviour should not be the one that is known to cause problems. Until and unless there is an actual fix for the problem (whether or not that is in the code), the default should be to keep the optimization, with an option to let individuals bypass that if they desire and accept the risks.

3. Characterizing the problem as "misconfigured VMs" is, at best, premature.

I will revert this change again on the 3.0 branch shortly, unless you have a better suggestion (I'm definitely all ears for a better suggestion!).
Comment by Trond Norbye [ 01/Aug/14 ]
If you look at the comment it pass the -D over into the CMAKE_C_FLAGS, causing it to be set into the compiler flags and it'll be passed on to compilation cycle.

As of misconfiguration, it is either insufficient resources on the vm or a "broken" compiler version installed there.
Comment by Trond Norbye [ 01/Aug/14 ]
Can I get login credentials to the server it fails and an identical vm where it succeeds.
Comment by Chris Hillery [ 01/Aug/14 ]
[CMAKE_C_FLAGS] Fair enough, I did misread that. That's not really a sufficient workaround, though. Doing that may overwrite other CFLAGS set by other parts of the build process.

I still maintain that the default behaviour should be the known-working version. However, for the moment I have temporarily locked the rel-3.0.0.xml manifest to the revision before my revert (ie, to 5cc2f8d928f0eef8bddbcb2fcb796bc5e9768bb8), so I won't revert anything else until that has been tested.

The only VM I know of at the moment where we haven't seen build failures is the production build slave. I can't give you access to that tonight as we're in crunch mode to produce a beta build. Let's plan to hook up next week and do some exploration.
Comment by Volker Mische [ 01/Aug/14 ]
There are commit message guidelines. At the bottom of

http://www.couchbase.com/wiki/display/couchbase/Contributing+Changes

links to:

http://en.wikibooks.org/wiki/Git/Introduction#Good_commit_messages
Comment by Trond Norbye [ 01/Aug/14 ]
I've not done anything on the 3.0.0 branch, the fix going forward is for 3.0.1 and trunk. Hopefully the 3.0 branch will die relatively soon since we've got a lot of good stuff in the 3.0.1 branch.

The "workaround" is not intended as a permanent solution, its just until the vms is fixed. I've not been able to reproduce this issue on my centos, ubuntu, fedora or smartos builders. They're running in the following vm's:

[root@00-26-b9-85-bd-92 ~]# vmadm list
UUID TYPE RAM STATE ALIAS
04bf8284-9c23-4870-9510-0224e7478f08 KVM 2048 running centos-6
7bcd48a8-dcc2-43a6-a1d8-99fbf89679d9 KVM 2048 running ubuntu
c99931d7-eaa3-47b4-b7f0-cb5c4b3f5400 KVM 2048 running fedora
921a3571-e1f6-49f3-accb-354b4fa125ea OS 4096 running compilesrv
Comment by Trond Norbye [ 01/Aug/14 ]
I need access to two identical configured builders where one may reproduce the error and one where it succeeds.
Comment by Volker Mische [ 01/Aug/14 ]
I would also add that I think it is about bad VMs. On the commit validation we have 6 VMs, It failed only always on ubuntu-1204-64-ci-01 due to this error and never on the others (ubuntu-1204-64-ci-02 - 06).
Comment by Chris Hillery [ 01/Aug/14 ]
That's not correct. The problem originally occurred on ci-03.
Comment by Volker Mische [ 01/Aug/14 ]
Then I need to correct it that my comment only holds true for the couchdb-gerrit-300 job.
Comment by Trond Norbye [ 01/Aug/14 ]
can I get login creds to one that it fails on? while I'm waiting for access to one that it works on?
Comment by Volker Mische [ 01/Aug/14 ]
I don't know about creds (I think my normal user login works) The machine details are here: http://factory.couchbase.com/computer/ubuntu-1204-64-ci-01/
Comment by Chris Hillery [ 01/Aug/14 ]
Volker - it was initially detected in the make-simple-github-tap job, so it's not unique to couchdb-gerrit-300 either. Both jobs pretty much just checkout the code and build it, though; they're pretty similar.
Comment by Trond Norbye [ 01/Aug/14 ]
Adding swap space to the builder makes the compilation pass. I've been trying to figure out how to get gcc to print more information about each step (the -ftime-reports memory usage didn't at all match the process usage ;-))




[MB-11548] Memcached does not handle going back in time. Created: 25/Jun/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Patrick Varley Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: customer, memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: No

 Description   
When you change the time of server to be a time in the pass when the memcached process is running it will start expiring all documents with TTL.

To recreate set date to a time in the past for example 2 hours ago from now.

sudo date --set="15:56:56"

You will see that time and uptime from cbstats will change to a large amount:

time: 5698679116
uptime: 4294946592

Looking at the code we can see how this happens:
http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#6462

When you change the time to a value in the past "process_started" will be greater than "timer.tv_sec" and current_time is unsigned which means it will wrap around.

What I do not understand from the code is why current_time is the number of seconds since memcached started and not just the epoch time? (There is a comment about avoiding 64bit) .

http://src.couchbase.org/source/xref/2.5.1/memcached/daemon/memcached.c#117

Any case we should check if "process_started" is bigger than "timer.tv_sec" do something smart.

I will let you decide what the smart thing is :)

 Comments   
Comment by Patrick Varley [ 07/Jul/14 ]
It would be good if we can get this fix into 3.0. Maybe a quick patch like this is good enough for now:

static void set_current_time(void) {
    struct timeval timer;

    gettimeofday(&timer, NULL);
    if (process_started < timer.tv_sec) {
        current_time = (rel_time_t) (timer.tv_sec - process_started);
    }
    else {
       settings.extensions.logger->log(EXTENSION_LOG_WARNING, NULL, "Time has gone backward shutting down to protect data.\n");
       shutdown_server();
}


More than happy to submit the code for review.
Comment by Chiyoung Seo [ 07/Jul/14 ]
Trond,

Can you see if we can address this issue in 3.0?
Comment by Jim Walker [ 08/Jul/14 ]
Looks to me like clock_handler (which wakes up every second) should be looking for time going backwards. It is sampling time every second so can easily see big shifts in the clock and make appropriate adjustments

I don't think we should be shutting down though if we can deal with it, but it does open interesting questions about TTLs and gettimeofday going backwards.

Perhaps we need to adjust process_started by the shift?

Happy to pick this up, just doing some other stuff at the moment...
Comment by Patrick Varley [ 08/Jul/14 ]
clock_handler calls set_current_time which is where all the damage is done.

I agree if we can handle it better we should not shutdown. I did think about changing process_started but that seem a bit like hack in my head but I cannot explain why :).
I was also wondering what should we do when time shifts forward?

I think this has some interesting affects on the stats too.
Comment by Patrick Varley [ 08/Jul/14 ]
Silly question but why not set current_time to epoch seconds instead of doing the offset from the process_started?
Comment by Jim Walker [ 09/Jul/14 ]
@patrick, this is shared code used by memcache and couchbase buckets. Note that memcache is storing expiry as "seconds since process" started and couch buckets store expiry as second since epoch, hence why a lot of this number shuffling is occurring.
Comment by Jim Walker [ 09/Jul/14 ]
get_current_time() is used for a number of time based lock checks (see getl) and document expiry itself (both within memcached and couchbase buckets).

process_started is an absolute time stamp and can lead to incorrect expiry if the real clock jumped. Example
 - 11:00am memcached started process_started = 11:00am (ignoring the - 2second thing)
 - 11:05am ntp comes in and aligns the node to the correct data-centre time (let’s say - 1hr) time is now 10:05am
 - 10:10am clients now set documents with absolute expiry of 10:45am
 - documents instantly expire because memcached thinks they’re in the past.. client scratches head.

Ultimately we need to ensure that the functions get_current_time(), realtime() and abstime() all do sensible things if the clock is changed, e.g. don’t return large unsigned values.
 
Given all this I think the requirements are:

R1 Define a memcached time tick interval (which is 1 second)
  - set_current_time() callback executes at this frequency.

R2 get_current_time() the value returned must be shielded from clock changes.
   - If clock goes backwards, the returned value still increases by R1.
   - If clock goes forwards, the returned value still increases by R1.
   - Really this returns process uptime in seconds and the stat “uptime” is just current_time.

R3 monitor the system time for jumps (forward or backward).
   - Reset process_started to be current time if there’s a change which is greater or less than R1 ticks.

R4 Ensure documentation describes the effect of system clock changes and the two ways you can set document expiry.
  

Overall the code changes are simple to address the issue, I will also look at making testrunner tests to ensure the system behaves.
Comment by Patrick Varley [ 09/Jul/14 ]
Sounds good, a small reminder about handling VMs that are suspended.
Comment by Jim Walker [ 24/Jul/14 ]
Patch for platform http://review.couchbase.org/39811
Patch for memcached http://review.couchbase.org/39813
Comment by Jim Walker [ 01/Aug/14 ]
Minor adjustment and improved warning

http://review.couchbase.org/40172




[MB-11799] Bucket compaction causes massive slowness of flusher and UPR consumers Created: 23/Jul/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Issue Links:
Duplicate
is duplicated by MB-11731 Persistence to disk suffers from buck... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.
Comment by Pavel Paulau [ 26/Jul/14 ]
Toy build helps a lot.

It doesn't fix the problem but at least minimize regression:
-- ep_upr_replica_items_remaining is close to zero now
-- write queue is 10x lower
-- max xdcr latency is about 8-9 second

Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/530/
Reports:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-785-toy_6ed_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-785-toy_269_access
Comment by Chiyoung Seo [ 26/Jul/14 ]
Thanks Pavel for the updates. We will merge the above changes soon.

Do you mean that both the disk write queue size and XDCR latency are still regression? or XDCR is only your major concern?

As you pointed above, the recent change in parallelizing the compaction (4 by default) is mostly the main root cause of this issue. Do you still see the compaction slowness in your tests? I guess "no" because we can now run 4 concurrent compaction tasks on each node.

I will talk to Aliaksey to understand that change more.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Pavel,

I will continue to look at some more optimizations in the ep-engine side. In the mean time, you may want to test the toy build again by lowering compaction_number_of_kv_workers in the ns-server side from 4 to 1. As mentioned in http://review.couchbase.org/#/c/39647/ , that parameter is configurable in the ns-server side.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Btw, all the changes above were merged. You can use the new build and lower the above compaction parameter.
Comment by Pavel Paulau [ 28/Jul/14 ]
Build 3.0.0-1035 with compaction_number_of_kv_workers = 1:

http://ci.sc.couchbase.com/job/perf-dev/533/artifact/

Source: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1035_276_access
Destination: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1035_624_access

Disk write queue is lower (max ~5-10K) but xdcr latency is still high (several seconds) and affected by compaction.
Comment by Chiyoung Seo [ 30/Jul/14 ]
Pavel,

The following change is merged:

http://review.couchbase.org/#/c/40043/

I plan to make another change for this issue today, but you may want to test it with the new build that includes the above fix
Comment by Chiyoung Seo [ 30/Jul/14 ]
I just pushed another important change in gerrit for review:

http://review.couchbase.org/#/c/40059/
Comment by Chiyoung Seo [ 30/Jul/14 ]
Pavel,

The above two changes were merged. Please retest it to see if they resolve this issue.
Comment by Pavel Paulau [ 31/Jul/14 ]
It doesn't not.

Comparison with previously tested build 3.0.0-1045:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1045_f29_access&snapshot=atlas_c1_300-1061_8b3_access

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1045_cf4_access&snapshot=atlas_c2_300-1061_3c8_access

Pretty much the same characteristics. Logs:
http://ci.sc.couchbase.com/job/xdcr-5x5/409/artifact/
Comment by Chiyoung Seo [ 31/Jul/14 ]
Thanks Pavel for the updates.

I debugged this issue more and found that a lot of UPR backfill tasks were scheduled unnecessarily even if items can be read from checkpoints in memory. Mike pushed the fix to address this issue:

http://review.couchbase.org/#/c/40145/
Comment by Pavel Paulau [ 01/Aug/14 ]
Can't verify.

Build 3.0.0-1076 with that change is missing here: http://latestbuilds.hq.couchbase.com/
Builds 3.0.0-1077+ are still broken.




[MB-11582] tmp_oom errors due to spike in memory usage during delta recovery Created: 27/Jun/14  Updated: 01/Aug/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-888

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File mem_used.png    
Issue Links:
Dependency
depends on MB-11734 Memory growing to over high water mar... Resolved
Relates to
relates to MB-10771 Delta recovery is slower than full re... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/234/artifact/
Is this a Regression?: No

 Description   
See MB-10771 for details.

 Comments   
Comment by Abhinav Dangeti [ 16/Jul/14 ]
Hey Pavel, Do you have any updates on this, since Chiyoung merged all the checkpoint related changes, plus the fix to memory leak with deletes from last 2 weeks?
Comment by Pavel Paulau [ 16/Jul/14 ]
Logs from build 3.0.0-966:

http://ci.sc.couchbase.com/job/ares/340/artifact/

It still happens.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Pavel Paulau [ 21/Jul/14 ]
According to @Abhinav.
Comment by Chiyoung Seo [ 23/Jul/14 ]
Pavel,

We recently made two more important changes that mitigated the memory issues:

http://review.couchbase.org/#/c/39556/
http://review.couchbase.org/#/c/38779/

Can you please test it again with the latest build that includes the above changes?
Comment by Pavel Paulau [ 23/Jul/14 ]
Saw tmp_oom errors in build 3.0.0-1005.

MB-11796 has logs.
Comment by Pavel Paulau [ 24/Jul/14 ]
Just in case, logs from build 3.0.0-1016: http://ci.sc.couchbase.com/job/ares/404/artifact/ .
Comment by Abhinav Dangeti [ 31/Jul/14 ]
Hey Pavel, I'm still looking into this issue.
In the meantime, I made a small improvement for handling tmpOOMs. With this change: http://review.couchbase.org/#/c/40037/, you should be seeing lesser tmpOOMs than before
Comment by Abhinav Dangeti [ 31/Jul/14 ]
This change: http://review.couchbase.org/#/c/40126, made the tmpOOMs go away with my tests.
Comment by Chiyoung Seo [ 31/Jul/14 ]
The above two fixes resolved the issue in our tests. Please verify it when the new build is ready.
Comment by Pavel Paulau [ 01/Aug/14 ]
Can't verify.

Build 3.0.0-1076 with that change is missing here: http://latestbuilds.hq.couchbase.com/
Builds 3.0.0-1077+ are still broken.




[MB-11865] Docs: Correctly specify the port to telnet to test the server Created: 01/Aug/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In the install guide [1] we tell people how to test their system using telnet. However, the telnet command omits the memcached (moxi) port number to connect to, and so these instructions don't actually work.

The correct command should be:

    telnet localhost1 11211


We should also probably highlight that this is connecting to the legacy memcached protocol using moxi).

[1]: http://docs.couchbase.com/couchbase-manual-2.5/cb-install/index.html#testing-with-telnet

 Comments   
Comment by Dave Rigby [ 01/Aug/14 ]
See this stack overflow question for confusion along these lines: http://stackoverflow.com/questions/25073498/couchbase-test-running-failed




[MB-11585] [windows] A query with stale=false never returns Created: 27/Jun/14  Updated: 01/Aug/14

Status: In Progress
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Tom Yeh Assignee: Ketaki Gangal
Resolution: Unresolved Votes: 0
Labels: viewquery
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Not sure what really happens, but my couchbase server never returns if I issues a query with stale=false. For example,

http://localhost:8092/default/_design/task/_view/by_project?stale=false&key=%22GgmBVrB9CGakdeHNnBMXZyms%22

Also, CPU usage is more than 95%.

It returns immediately if I don't specify stale=false.

It worked fine before, but not sure what happened. The database corrupted? Anything I can do?

It is a development environment, so the data is small -- only about 120 docs (and the query shall return only about 10 docs).

NOTE: the output of cbcollect_info is uploaded to https://s3.amazonaws.com/customers.couchbase.com/zk


 Comments   
Comment by Sriram Melkote [ 07/Jul/14 ]
Nimish, can you please look at the cbcollect and see if you can analyze the reason the query did not return?
Comment by Nimish Gupta [ 09/Jul/14 ]
From the logs, it looks like 200 Ok response header was sent back to client

[couchdb:info,2014-06-27T21:43:49.754,ns_1@127.0.0.1:<0.6397.0>:couch_log:info:39]127.0.0.1 - - GET /default/_design/task/_view/by_project?stale=false&key=%22GgmBVrB9CGakdeHNnBMXZyms%22 200

From the logs, we cant figure out if couchbase didn't send the response body. On windows , we have issue of indexing getting stuck (https://www.couchbase.com/issues/browse/MB-11385). But I am not sure if that bug is root cause for this issue from the logs.

Comment by Sriram Melkote [ 15/Jul/14 ]
Waiting for 3.0 system testing to see if we can reproduce this locally
Comment by Sriram Melkote [ 22/Jul/14 ]
Ketaki, can we please look out for this issue in 3.0 windows system tests? Specifically, is there a test that will fail if a single query does not respond among many?




[MB-11670] Rebuild whole project when header file changes Created: 08/Jul/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Volker Mische Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When you change a header file in the view-engine (couchdb project) the whole project should be rebuild.

Currently if you change a header file and you don't clean up the project you could end up with run-time errors like a badmatch on the #writer_acc record.

PS: I opened that as an MB bug and not as a CBD as this is valueable information about badmatch errors that should be public.

 Comments   
Comment by Chris Hillery [ 09/Jul/14 ]
This really has nothing to do with build team, and as such it's perfectly appropriate for it to be MB.

I'm assigning it back to Volker for some more information. Can you give me a specific set of actions you can take that demonstrate this not happening? Is it to do with Erlang code, or C++?
Comment by Volker Mische [ 09/Jul/14 ]
Build Couchbase with a make.

Now edit a couchdb Erlang header file. For example edit couchdb/src/couch_set_view/include/couch_set_view.hrl and comment this block out (with leading `%`):

-record(set_view_params, {
    max_partitions = 0 :: non_neg_integer(),
    active_partitions = [] :: [partition_id()],
    passive_partitions = [] :: [partition_id()],
    use_replica_index = false :: boolean()
}).

When you do a "make" again, ns_server will complain about something missing, but couchdb won't as it doesn't rebuild at all.

Chris, I hope this information is good enough, if you need more, let me know.
Comment by Anil Kumar [ 30/Jul/14 ]
Triage : Anil, Wayne .. July 30th

Ceej/Volker - Please update the ticket.
Comment by Chris Hillery [ 30/Jul/14 ]
No update, working on beta issues.
Comment by Sriram Melkote [ 01/Aug/14 ]
Moving to 3.0.1 as I think it's probably too late to add this dependency detection for 3.0 build system




[MB-11856] Change default max_doc_size and other parameters to sensible default limit values Created: 30/Jul/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Sarath Lakshman Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown




[MB-10921] Possibly file descriptor leak? Created: 22/Apr/14  Updated: 01/Aug/14  Resolved: 01/Aug/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Trond Norbye Assignee: Nimish Gupta
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File df_output.rtf     File du_output.rtf     File ls_delete.rtf     File lsof_10.6.2.164.rtf     File lsof_beam.rtf     File ls_output.rtf    
Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I ran df and du on that server and noticed similar figures (full console log at the end of this email). du reports 68GB on /var/opt/couchbase, whereas df reports ~140GB of disk usage. The lsof command shows that there are several files which have been deleted but are still opened by beam.smp. Those files are in /var/opt/couchbase/.delete/, and their total size amounts to the “Other Data” (roughly 70GB).
 
I’ve never noticed that before, yet recently we started playing with CB views. I wonder if that can be related. Also note that at the time I did the investigation, there had been no activity on the cluster for several hours: no get/set on the buckets, and no compaction or indexing was ongoing.
Are you aware of this problem? What can we do about it?

beam.smp 55872 couchbase 19u REG 8,17 16136013537 4849671 /var/opt/couchbase/.delete/babe701b000ce862e58ca2edd1b8098b (deleted)
beam.smp 55872 couchbase 34u REG 8,17 1029765807 4849668 /var/opt/couchbase/.delete/5c6df85a423263523471f6e20d82ce07 (deleted)
beam.smp 55872 couchbase 51u REG 8,17 1063802330 4849728 /var/opt/couchbase/.delete/c2a11ea6f3e70f8d222ceae9ed482b13 (deleted)
beam.smp 55872 couchbase 55u REG 8,17 403075242 4849667 /var/opt/couchbase/.delete/6af0b53325bf4f2cd1df34b476ee4bb6 (deleted)
beam.smp 55872 couchbase 56r REG 8,17 403075242 4849667 /var/opt/couchbase/.delete/6af0b53325bf4f2cd1df34b476ee4bb6 (deleted)
beam.smp 55872 couchbase 57u REG 8,17 861075170 4849666 /var/opt/couchbase/.delete/72a08b8a613198cd3a340ae15690b7f1 (deleted)
beam.smp 55872 couchbase 58r REG 8,17 861075170 4849666 /var/opt/couchbase/.delete/72a08b8a613198cd3a340ae15690b7f1 (deleted)
beam.smp 55872 couchbase 59r REG 8,17 1029765807 4849668 /var/opt/couchbase/.delete/5c6df85a423263523471f6e20d82ce07 (deleted)
beam.smp 55872 couchbase 60r REG 8,17 896931996 4849672 /var/opt/couchbase/.delete/3b1b7aae4af60e9e720ad0f0d3c0182c (deleted)
beam.smp 55872 couchbase 63r REG 8,17 976476432 4849766 /var/opt/couchbase/.delete/6f5736b1ed9ba232084ee7f0aa5bd011 (deleted)
beam.smp 55872 couchbase 66u REG 8,17 18656904860 4849675 /var/opt/couchbase/.delete/fcaf4193727374b471c990a017a20800 (deleted)
beam.smp 55872 couchbase 67u REG 8,17 662227221 4849726 /var/opt/couchbase/.delete/4e7bbc192f20def5d99447b431591076 (deleted)
beam.smp 55872 couchbase 70u REG 8,17 896931996 4849672 /var/opt/couchbase/.delete/3b1b7aae4af60e9e720ad0f0d3c0182c (deleted)
beam.smp 55872 couchbase 74r REG 8,17 662227221 4849726 /var/opt/couchbase/.delete/4e7bbc192f20def5d99447b431591076 (deleted)
beam.smp 55872 couchbase 75u REG 8,17 1896522981 4849670 /var/opt/couchbase/.delete/3ce0c5999854691fe8e3dacc39fa20dd (deleted)
beam.smp 55872 couchbase 81u REG 8,17 976476432 4849766 /var/opt/couchbase/.delete/6f5736b1ed9ba232084ee7f0aa5bd011 (deleted)
beam.smp 55872 couchbase 82r REG 8,17 1063802330 4849728 /var/opt/couchbase/.delete/c2a11ea6f3e70f8d222ceae9ed482b13 (deleted)
beam.smp 55872 couchbase 83u REG 8,17 1263063280 4849673 /var/opt/couchbase/.delete/e06facd62f73b20505d2fdeab5f66faa (deleted)
beam.smp 55872 couchbase 85u REG 8,17 1000218613 4849767 /var/opt/couchbase/.delete/0c4fb6d5cd7d65a4bae915a4626ccc2b (deleted)
beam.smp 55872 couchbase 87r REG 8,17 1000218613 4849767 /var/opt/couchbase/.delete/0c4fb6d5cd7d65a4bae915a4626ccc2b (deleted)
beam.smp 55872 couchbase 90u REG 8,17 830450260 4849841 /var/opt/couchbase/.delete/7ac46b314e4e30f81cdf0cd664bb174a (deleted)
beam.smp 55872 couchbase 95r REG 8,17 1263063280 4849673 /var/opt/couchbase/.delete/e06facd62f73b20505d2fdeab5f66faa (deleted)
beam.smp 55872 couchbase 96r REG 8,17 1896522981 4849670 /var/opt/couchbase/.delete/3ce0c5999854691fe8e3dacc39fa20dd (deleted)
beam.smp 55872 couchbase 97u REG 8,17 1400132620 4849719 /var/opt/couchbase/.delete/e8eaade7b2ee5ba7a3115f712eba623e (deleted)
beam.smp 55872 couchbase 103r REG 8,17 16136013537 4849671 /var/opt/couchbase/.delete/babe701b000ce862e58ca2edd1b8098b (deleted)
beam.smp 55872 couchbase 104u REG 8,17 1254021993 4849695 /var/opt/couchbase/.delete/f77992cdae28194411b825fa52c560cd (deleted)
beam.smp 55872 couchbase 105r REG 8,17 1254021993 4849695 /var/opt/couchbase/.delete/f77992cdae28194411b825fa52c560cd (deleted)
beam.smp 55872 couchbase 106r REG 8,17 1400132620 4849719 /var/opt/couchbase/.delete/e8eaade7b2ee5ba7a3115f712eba623e (deleted)
beam.smp 55872 couchbase 108u REG 8,17 1371453421 4849793 /var/opt/couchbase/.delete/9b8b199920075102e52742c49233c57c (deleted)
beam.smp 55872 couchbase 109r REG 8,17 1371453421 4849793 /var/opt/couchbase/.delete/9b8b199920075102e52742c49233c57c (deleted)
beam.smp 55872 couchbase 111r REG 8,17 18656904860 4849675 /var/opt/couchbase/.delete/fcaf4193727374b471c990a017a20800 (deleted)
beam.smp 55872 couchbase 115u REG 8,17 16442158432 4849708 /var/opt/couchbase/.delete/2b70b084bd9d0a1790de9b3ee6c78f69 (deleted)
beam.smp 55872 couchbase 116r REG 8,17 16442158432 4849708 /var/opt/couchbase/.delete/2b70b084bd9d0a1790de9b3ee6c78f69 (deleted)
beam.smp 55872 couchbase 151r REG 8,17 830450260 4849841 /var/opt/couchbase/.delete/7ac46b314e4e30f81cdf0cd664bb174a (deleted)
beam.smp 55872 couchbase 181u REG 8,17 770014022 4849751 /var/opt/couchbase/.delete/d35ac74521ae4c1d455c60240e1c41e1 (deleted)
beam.smp 55872 couchbase 182r REG 8,17 770014022 4849751 /var/opt/couchbase/.delete/d35ac74521ae4c1d455c60240e1c41e1 (deleted)
beam.smp 55872 couchbase 184u REG 8,17 775017865 4849786 /var/opt/couchbase/.delete/2a85b841a373ee149290b0ec906aae55 (deleted)
beam.smp 55872 couchbase 185r REG 8,17 775017865 4849786 /var/opt/couchbase/.delete/2a85b841a373ee149290b0ec906aae55 (deleted)

 Comments   
Comment by Volker Mische [ 22/Apr/14 ]
Filipe, could you have a look at this?

I also found in the bug tracker an issue that we needed to patch Erlang because of file descriptor leaks (CBD-753 [1]). Could it be related?

[1]: http://www.couchbase.com/issues/browse/CBD-753
Comment by Trond Norbye [ 22/Apr/14 ]
From the comments in that bug that seems to be blocker for 2.1 testing and this is 2.2...
Comment by Volker Mische [ 22/Apr/14 ]
Trond, though Erlang was patched, so it's independent of the Couchbase version, but depends on Erlang. Though I guess you use Erlang >= R16 anyway (which should have that patch).

Could you also try it with 2.5? Perhaps it has been fixed already.
Comment by Filipe Manana [ 22/Apr/14 ]
There's no useful information here to work on or conclude anything.

First of all, it may be database files. Both database and view files are renamed to uuids and moved to .delete directory. And before 3.0, database compaction is orchestrated in erlang land (rename + delete).

Second, we had in the past such leaks, one caused by Erlang itself (whence a patched R14/R15 is needed, or R16 unpatched) and others caused by CouchDB upstream code, which got fixed before 2.0 (and in Apache CouchDB 1.2) - geocouch is based on a copy of CouchDB's view engine that is much older than Apache CouchDB 1.x whence suffers the same leaks issue (not closing files after compactions amongst other cases).

Given there's no concrete steps to reproduce this, nor has anyone observed this recently, I can't exclude the possibility of him using the geo/spatial views or running an unpatched Erlang.
Comment by Trond Norbye [ 22/Apr/14 ]
Volker: We ship a bundled erlang in our releases don't we?

Filipe: I forwarded the email to you, Volker and Alk April the 8th with all the information I had. We can ask back for more information if that helps us pinpoint where it is..
Comment by Volker Mische [ 22/Apr/14 ]
Trond: I wasn't aware that the "I" isn't you :)

I would ask to try it again on 2.5.1 and if it's still there the steps to reproduce it.
Comment by Trond Norbye [ 22/Apr/14 ]
The customer have the system running. Are we sure there is no commands to run on the erlang thing to gather more information on the current state?
Comment by Sriram Melkote [ 06/Jun/14 ]
Ketaki, can we please make sure in 3.0 tests:

(a) Number of open file descriptors does not keep growing
(b) The files in .delete directory get cleaned up eventually
(c) Disk space does not keep growing

If both are true in long running tests, we can close this for 3.0
Comment by Sriram Melkote [ 16/Jun/14 ]
I'm going to close this as we've not seen evidence of fd leaks in R16 so far. If system tests encounter fd leak, please reopen.
Comment by Sriram Admin [ 01/Jul/14 ]
Reopen as we're seeing it in another place, CBSE-1247 making it likely this is a product (and not environment) issue
Comment by Sriram Melkote [ 03/Jul/14 ]
Nimish, can we:

(a) Give a specific pattern (i.e., view-{uuid} or something) so we can distinguish KV files from View files after moving to .delete
(b) Can we add a log message to count number of entries and size of the .delete directory


This will help us see if we're accumulating .delete files during our system testing.
Comment by Nimish Gupta [ 09/Jul/14 ]
Hi, It can due be due views may not be closing the fds before deleting the files. I have added a debug message to log the filename when we call delete (http://review.couchbase.org/#/c/39233/). Ketaki, could you please try to reproduce the issue with the latest build.
Comment by Sriram Melkote [ 09/Jul/14 ]
Ketaki, can you please attach logs from a system test run with rebalance etc, with the above change merged? This will help us understand how to fix the problem better.
Comment by Ketaki Gangal [ 09/Jul/14 ]
Yes, will do. The above changes are a part of build 3.0.0-943-rel. With next run of system tests, I will update this bug.
Comment by Nimish Gupta [ 09/Jul/14 ]
Ketaki, Please attach the output of "ls -l" also in /var/opt/couchbase/.delete directory after running the test.
Comment by Ketaki Gangal [ 11/Jul/14 ]
- Run on build 3.0.0-943-re.
- Attached information from System cluster here

From all nodes
1. ls -l from " /opt/couchbase/var/lib/couchbase/data/.delete/"* : its zero all across
2. df
3.du
4. lsof from one of the nodes and below lsof beam.smp - dont see anything unsual
5. Collect info from all the nodes, however a cursory grep on "Deleting couch file " did not yield any output.

* Also the disk usage/ pattern seems consistent and expected on the current runs.

This cluster has had the following rebalance related operations
- Rebalance In 1
- Swap Rebalance
- Rebalance Out

Logs from the cluster https://s3.amazonaws.com/bugdb/MB-10921/10921.tar
Comment by Ketaki Gangal [ 11/Jul/14 ]
I dont see anything usual with the output below however --

[root@centos-64-x64 fd]# pidof beam.smp
1330 1192 1134

[root@centos-64-x64 fd]# lsof -a -p 1330
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
beam.smp 1330 couchbase cwd DIR 253,0 4096 529444 /opt/couchbase/var/lib/couchbase
beam.smp 1330 couchbase rtd DIR 253,0 4096 2 /
beam.smp 1330 couchbase txt REG 253,0 54496893 523964 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp
beam.smp 1330 couchbase mem REG 253,0 165929 525131 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto_callback.so
beam.smp 1330 couchbase mem REG 253,0 88600 392512 /lib64/libz.so.1.2.3
beam.smp 1330 couchbase mem REG 253,0 1408384 1183516 /usr/lib64/libcrypto.so.0.9.8e
beam.smp 1330 couchbase mem REG 253,0 476608 525130 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto.so
beam.smp 1330 couchbase mem REG 253,0 135896 392504 /lib64/libtinfo.so.5.7
beam.smp 1330 couchbase mem REG 253,0 1916568 392462 /lib64/libc-2.12.so
beam.smp 1330 couchbase mem REG 253,0 43832 392490 /lib64/librt-2.12.so
beam.smp 1330 couchbase mem REG 253,0 142464 392486 /lib64/libpthread-2.12.so
beam.smp 1330 couchbase mem REG 253,0 140096 392500 /lib64/libncurses.so.5.7
beam.smp 1330 couchbase mem REG 253,0 595688 392470 /lib64/libm-2.12.so
beam.smp 1330 couchbase mem REG 253,0 19536 392468 /lib64/libdl-2.12.so
beam.smp 1330 couchbase mem REG 253,0 14584 392494 /lib64/libutil-2.12.so
beam.smp 1330 couchbase mem REG 253,0 154464 392455 /lib64/ld-2.12.so
beam.smp 1330 couchbase 0r FIFO 0,8 0t0 11761 pipe
beam.smp 1330 couchbase 1w FIFO 0,8 0t0 11760 pipe
beam.smp 1330 couchbase 2w FIFO 0,8 0t0 11760 pipe
beam.smp 1330 couchbase 3u REG 0,9 0 3696 anon_inode
beam.smp 1330 couchbase 4r FIFO 0,8 0t0 11828 pipe
beam.smp 1330 couchbase 5w FIFO 0,8 0t0 11828 pipe
beam.smp 1330 couchbase 6r FIFO 0,8 0t0 11829 pipe
beam.smp 1330 couchbase 7w FIFO 0,8 0t0 11829 pipe
beam.smp 1330 couchbase 8w REG 253,0 8637 529519 /opt/couchbase/var/lib/couchbase/logs/ssl_proxy.log
beam.smp 1330 couchbase 9u IPv4 12253 0t0 TCP *:11214 (LISTEN)
beam.smp 1330 couchbase 10u IPv4 12255 0t0 TCP localhost:11215 (LISTEN)
Comment by Nimish Gupta [ 14/Jul/14 ]
Ketaki, Could you please run the test for longer duration (e.g. 2-3 days) and check for the number of files in .delete directory. Please upload the logs also if you see files in the .delete directory.
Moreover we can add a check in testrunner also to check for .delete directory after all the tests finishes.
Comment by Ketaki Gangal [ 14/Jul/14 ]
Hi Nimish,

This test runs for 2-3 days and the logs are from this run.
From the logs / outputs attached above, I dont see any files on the /data/.delete directory.

Btw, the testrunner tests are much smaller, run singular rebalances per test and have the clusters torn down after, I am not sure if this would be helpful to add a ".delete" check on it.
Please advise.

Comment by Sriram Melkote [ 14/Jul/14 ]
Hi Ketaki - yes, please add the check to the long running tests only. It could note the contents of the "delete" directory before stopping the cluster.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

could be related to R-14 and might go away. added logging to investigate. keeping it open to confirm from QE testing.
Comment by Ketaki Gangal [ 22/Jul/14 ]
Re-tested with build 3.0.0-973-rel. Not observing any files in .delete dir

Runtime : 72 hours+

Rebalance operations on this cluster include
1- Rebalance in 1 node
2. Swap Rebalance
3. Rebalance out 1 node
4. Rebalance in 2 nodes
5. Failover , add back

Logs attached incldude
1. Collect Info from the cluster https://s3.amazonaws.com/bugdb/MB-10921/10921-2.tar
2. .delete contents https://www.couchbase.com/issues/secure/attachment/21346/ls_delete.rtf
3. lsof beam https://www.couchbase.com/issues/secure/attachment/21347/lsof_beam.rtf

Comment by Sriram Melkote [ 22/Jul/14 ]
Nimish - it looks like Ketaki's runs show clean result after a long test, and so we have done due diligence and can assume (R16 probably) fixed the leak. If you agree, please close the bug. Thanks for your help Ketaki.
Comment by Ketaki Gangal [ 22/Jul/14 ]
Seeing some entries on .delete directory on the jenkins run suite -- /opt/couchbase/var/lib/couchbase/data/.delete ..

Testrunner suite which runs into this is http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/93/consoleFull


From one of the nodes : 172.23.106.186

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# cd .delete/
root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data/.delete# ls -alth
total 32K
drwxrwx--- 6 couchbase couchbase 4.0K Jul 22 13:32 ..
drwxrwx--- 8 couchbase couchbase 4.0K Jul 22 13:24 .
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 13:09 315f1070a8e2e6413c8bf8177aa75f48
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 10:23 a659f4c60beb398c38b7a2563694f5fe
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 09:53 a8d29f6762f20ff56f6c542b19787d88
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 09:35 5f6ed028ea9afb7f7a1a09ae45fc3579
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 04:11 b8cd56417d94eba728a2a21e27c487b6
drwxrwx--- 2 couchbase couchbase 4.0K Jul 22 02:39 d47537360387b3fc6ba8d740acd61d34

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# du -h .delete/
4.0K .delete/315f1070a8e2e6413c8bf8177aa75f48
4.0K .delete/a8d29f6762f20ff56f6c542b19787d88
4.0K .delete/b8cd56417d94eba728a2a21e27c487b6
4.0K .delete/a659f4c60beb398c38b7a2563694f5fe
4.0K .delete/d47537360387b3fc6ba8d740acd61d34
4.0K .delete/5f6ed028ea9afb7f7a1a09ae45fc3579
28K .delete/

root@cherimoya-s21611:/opt/couchbase/var/lib/couchbase/data# lsof -a -p 9355
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
beam.smp 9355 couchbase cwd DIR 252,0 4096 136610 /opt/couchbase/var/lib/couchbase
beam.smp 9355 couchbase rtd DIR 252,0 4096 2 /
beam.smp 9355 couchbase txt REG 252,0 52200651 135515 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp
beam.smp 9355 couchbase mem REG 252,0 182169 135218 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto_callback.so
beam.smp 9355 couchbase mem REG 252,0 92720 1177572 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
beam.smp 9355 couchbase mem REG 252,0 1612544 1183307 /lib/x86_64-linux-gnu/libcrypto.so.0.9.8
beam.smp 9355 couchbase mem REG 252,0 495975 135217 /opt/couchbase/lib/erlang/lib/crypto-3.2/priv/lib/crypto.so
beam.smp 9355 couchbase mem REG 252,0 159200 1177451 /lib/x86_64-linux-gnu/libtinfo.so.5.9
beam.smp 9355 couchbase mem REG 252,0 1811128 1177440 /lib/x86_64-linux-gnu/libc-2.15.so
beam.smp 9355 couchbase mem REG 252,0 31752 1177509 /lib/x86_64-linux-gnu/librt-2.15.so
beam.smp 9355 couchbase mem REG 252,0 135366 1177444 /lib/x86_64-linux-gnu/libpthread-2.15.so
beam.smp 9355 couchbase mem REG 252,0 133808 1177407 /lib/x86_64-linux-gnu/libncurses.so.5.9
beam.smp 9355 couchbase mem REG 252,0 1030512 1182376 /lib/x86_64-linux-gnu/libm-2.15.so
beam.smp 9355 couchbase mem REG 252,0 14768 1177438 /lib/x86_64-linux-gnu/libdl-2.15.so
beam.smp 9355 couchbase mem REG 252,0 10632 1182384 /lib/x86_64-linux-gnu/libutil-2.15.so
beam.smp 9355 couchbase mem REG 252,0 149280 1182383 /lib/x86_64-linux-gnu/ld-2.15.so
beam.smp 9355 couchbase 0r FIFO 0,8 0t0 706505 pipe
beam.smp 9355 couchbase 1w FIFO 0,8 0t0 706504 pipe
beam.smp 9355 couchbase 2w FIFO 0,8 0t0 706504 pipe
beam.smp 9355 couchbase 3u 0000 0,9 0 7808 anon_inode
beam.smp 9355 couchbase 4r FIFO 0,8 0t0 705201 pipe
beam.smp 9355 couchbase 5w FIFO 0,8 0t0 705201 pipe
beam.smp 9355 couchbase 6r FIFO 0,8 0t0 705202 pipe
beam.smp 9355 couchbase 7w FIFO 0,8 0t0 705202 pipe
beam.smp 9355 couchbase 8w REG 252,0 4320 139124 /opt/couchbase/var/lib/couchbase/logs/ssl_proxy.log
beam.smp 9355 couchbase 9u IPv4 702269 0t0 TCP *:11214 (LISTEN)
beam.smp 9355 couchbase 10u IPv4 705207 0t0 TCP localhost:11215 (LISTEN)


Logs https://s3.amazonaws.com/bugdb/MB-10921/10921-3.tar
-- note this is from a set of tests and not a single test in itself. I am not currently certain of how reproducible this is. But I am seeing this across a couple of machines which are failing due to view-queries taking longer time to run.
Comment by Nimish Gupta [ 30/Jul/14 ]
These are the directories in .delete directory not files. These directories also looks to be empty. In lsof also there is no file which is deleted. So this is not a critical case. I suggest that we can close this jira and reopen it if we see files in the .delete directories.
Comment by Anil Kumar [ 30/Jul/14 ]
Triage : Anil, Wayne .. July 30th

Trond/ Nimish - Please update the ticket. Let us know if you're planning to fix it by 3.0
Comment by Sriram Melkote [ 01/Aug/14 ]
Closing as not reproducible as we have not seen leaked file so far. If we do find, please reopen this bug.




[MB-11849] couch_view_index_updater crashes (Segmentation fault) during test with stale=false queries Created: 29/Jul/14  Updated: 01/Aug/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1045

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = 2 x SSD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/leto/407/artifact/
Is this a Regression?: Yes

 Description   
Backtraces are attached.

There are many couchdb errors in logs as well.

 Comments   
Comment by Sriram Melkote [ 29/Jul/14 ]
Nimish, it appears like it may be the exit_thread_helper problem again in a different part of the code?
Comment by Pavel Paulau [ 30/Jul/14 ]
I tried more recent build (1057), segfault didn't happen (logs: http://ci.sc.couchbase.com/job/leto/412/artifact/).
Most likely the issue is occasional.
Comment by Nimish Gupta [ 30/Jul/14 ]
Hi, This is not a problem with exit_thread_helper. Looks like updater is not able to open a sort file. There is no information in the logs regarding that sort file. Without core dump, I feel it is not possible to know the exact reason for failure in opening the file.
 
Due to failure in opening the file, error case was getting executed and there was a minor bug, it crashed. I have fixed that minor bug and code is in review (http://review.couchbase.org/40052).

Pavel, couch you please attach the core dump of the couch_view_index_updater ?


Comment by Pavel Paulau [ 30/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11849/core.couch_view_inde.27157.leto-s303.1406654330
Comment by Nimish Gupta [ 31/Jul/14 ]
I was not able to see the backtrace with the core since core file was not matching executable file. I tried with spinning a new machine and installing the build. But nothing worked.
Pavel, could you please try to reproduce the issue with the latest build and give me the access to machine after you reproduce it.
Comment by Pavel Paulau [ 31/Jul/14 ]
What package did you use?
Comment by Nimish Gupta [ 31/Jul/14 ]
http://latestbuilds.hq.couchbase.com/couchbase-server-enterprise_centos6_x86_64_3.0.0-1045-rel.rpm




[MB-11755] vbucket-seqno stats requests getting timed out during view queries Created: 17/Jul/14  Updated: 01/Aug/14  Resolved: 01/Aug/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Meenakshi Goel
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-966-rel

Issue Links:
Relates to
relates to MB-11706 Graceful failover gets to 55% then hangs Resolved
Triage: Triaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Test to reproduce:
view.viewquerytests.ViewQueryTests.test_employee_dataset_query_different_buckets,docs-per-day=500,sasl_buckets=4,standard_buckets=4

Note: This issue is opened to track the performance impact on view queries due to these timeouts.

Seeing lot of timeout errors in logs:
[couchdb:error,2014-07-16T11:37:13.190,ns_1@172.23.107.20:<0.7628.0>:couch_log:error:42]upr client (<0.7647.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-16T11:37:13.190,ns_1@172.23.107.20:<0.7532.0>:couch_log:error:42]upr client (<0.7551.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-16T11:37:13.190,ns_1@172.23.107.20:<0.6418.0>:couch_log:error:42]upr client (<0.6428.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-16T11:37:13.196,ns_1@172.23.107.20:<0.17371.0>:couch_log:error:42]upr client (<0.17390.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-16T11:37:13.196,ns_1@172.23.107.20:<0.17053.0>:couch_log:error:42]upr client (<0.17071.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...
[couchdb:error,2014-07-16T11:37:13.196,ns_1@172.23.107.20:<0.14591.0>:couch_log:error:42]upr client (<0.14610.0>): vbucket-seqno stats timed out after 2.0 seconds. Waiting...

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17

Keeping this for tracking performance. Related to MB-11706.
Comment by Ketaki Gangal [ 21/Jul/14 ]
Seeing identical timeouts on the following test

./testrunner -i /tmp/rebal.ini active_resident_threshold=100,dgm_run=true,get-delays=True,get-cbcollect-info=True,eviction_policy=fullEviction,max_verify=100000 -t rebalance.rebalancein.RebalanceInTests.rebalance_in_with_ddoc_compaction,items=500000,max_verify=100000,GROUP=IN;BASIC;COMPACTION;P0;FROM_2_0

Steps
- Create cluster, bucket
- Load items
- Query the views --- started seeing above timeouts.

Details : https://github.com/couchbase/testrunner/blob/master/pytests/rebalance/rebalancein.py#L363
Jenkins job :http://qa.sc.couchbase.com/job/centos_x64-00_02-tunable-rebalance-P0/64/console
Comment by Sriram Melkote [ 01/Aug/14 ]
Marking as fixed because we issue no more than 4 STATS request per second MB-11728




[MB-11187] V8 crashes on memory allocation errors, closes erlang on some indexing loads Created: 22/May/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.2.0, 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Brent Woodruff Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: hotfix
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
On some indexing workloads, V8 can experience issues allocating memory. This, in turn, will cause Erlang to close resulting in the node becoming pending while the babysitter restarts the main Erlang VM.

This can occur even when there is sufficient memory available on the node. The node does not need to experience out-of-memory for this to happen.

To diagnose if this is occurring, it's possible to check logs for a few messages.

babysitter log:

====
[ns_server:info,2014-05-16T20:46:44.417,babysitter_of_ns_1@127.0.0.1:<0.71.0>:ns_port_server:log:168]ns_server<0.71.0>:
ns_server<0.71.0>: #
ns_server<0.71.0>: # Fatal error in CALL_AND_RETRY_2
ns_server<0.71.0>: # Allocation failed - process out of memory
ns_server<0.71.0>: #
ns_server<0.71.0>:

[ns_server:info,2014-05-16T20:46:44.744,babysitter_of_ns_1@127.0.0.1:<0.71.0>:ns_port_server:log:168]ns_server<0.71.0>: /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.Erlang has closed
ns_server<0.71.0>:

[ns_server:info,2014-05-16T20:46:44.745,babysitter_of_ns_1@127.0.0.1:<0.70.0>:supervisor_cushion:handle_info:58]Cushion managed supervisor for ns_server failed: {abnormal,134}
[error_logger:error,2014-05-16T20:46:44.745,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.71.0> terminating
** Last message in was {#Port<0.2943>,{exit_status,134}}
** When Server state == {state,#Port<0.2943>,ns_server,
                               {[[],
                                 "/opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.Erlang has closed ",
                                 [],"#",
                                 "# Allocation failed - process out of memory",
                                 "# Fatal error in CALL_AND_RETRY_2","#",[],
                                 "working as port","working as port",
                                 "Apache CouchDB has started. Time to relax.",
                                 "Apache CouchDB 1.2.0a-01dda76-git (LogLevel=info) is starting.",
                                 empty],
                                [empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty,empty,empty,empty,empty,
                                 empty,empty,empty]},
                               {ok,{1400273204926,#Ref<0.0.0.18660>}},
                               [[],
                                "/opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.Erlang has closed "],
                               0,true}
** Reason for termination ==
** {abnormal,134}
====

One can check for the latest occurrence of this error (or for all occurrences by removing the tail -1) with this command:

$ /opt/couchbase/bin/cbbrowse_logs babysitter | awk '/\]ns_server<.*>: $/,/# Allocation failed - process out of memory/' | grep -E -o '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]*:[0-9]{2}:[0-9]{2}'
2014-05-16T20:46:44


 Comments   
Comment by Sarath Lakshman [ 23/May/14 ]
Backported changes for v8 version upgrade.

Following changes are under review:
http://review.couchbase.org/#/c/37506/
http://review.couchbase.org/#/c/37507/
Comment by Sarath Lakshman [ 23/May/14 ]
Merged back ported changes
Comment by Sarath Lakshman [ 23/May/14 ]
I have sent a mail to build team for providing the build with the following changes:

couchdb commit (https://github.com/couchbase/couchdb/commit/505c2278b34eb7a47843d5017d101e98aa856d6a)

v8 version change (http://review.couchbase.org/#/c/32781/2/override-3.0.0.xml)
Comment by Sriram Melkote [ 29/May/14 ]
Wayne - it's good to have some QE coverage for this as it's shipping to a production customer. Can Ketaki or Meenakshi please validate and also do some additional sanity testing on this build? Thanks!
Comment by Sarath Lakshman [ 29/May/14 ]
Brent, Could you help us verify this patch using the reproducible setup you have ?
Comment by Sarath Lakshman [ 29/May/14 ]
No. We should wait for QE team to certify this fix before rollout to the customer.
Comment by Wayne Siu [ 29/May/14 ]
Brent,
We still need to run the regression tests on the hotfix. I'll update the ticket with an ETA on Monday.
In a mean time, if the customer could help verify the hotfix, it will be ok with the expectation that regression tests are still in progress.
Comment by Wayne Siu [ 09/Jun/14 ]
Brent,
We can provide the ubuntu 10.4 package. We'll run a quick sanity on the binary, and update the ticket here later. Will shoot for later today.
Comment by Wayne Siu [ 12/Jun/14 ]
The 10.4 package passed the sanity tests.
Comment by Wayne Siu [ 20/Jun/14 ]
Brent,
Please let us know if we could close this ticket.
Comment by Brent Woodruff [ 20/Jun/14 ]
I believe it would be ok to close this MB. The backporting work has been completed, the builds required have been made and tested, and the updated files were provided.
Comment by Wayne Siu [ 31/Jul/14 ]
Brent,
Let us know if there is any open item at this time.




[MB-11815] Support Ubuntu 14.04 as supported platform Created: 24/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Anil Kumar Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to

 Description   
We need to add support for Ubuntu 14.04.




[MB-11779] Memory underflow in updates-only scenario with 5 buckets Created: 21/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/view/lab/job/perf-dev/503/artifact/
Is this a Regression?: Yes

 Description   
Essentially re-opened MB-11661.

2 nodes, 5 buckets, 200K x 1KB docs per bucket (non-DGM), 2K updates per bucket.

Mon Jul 21 13:24:34.955935 PDT 3: (bucket-1) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...

 Comments   
Comment by Sriram Ganesan [ 22/Jul/14 ]
Pavel

How often would you say this reproduces in your environment? I tried this locally a few times and didn't hit this.
Comment by Pavel Paulau [ 23/Jul/14 ]
Pretty much every time.

It usually takes >10 hours before test encounters GIGANTOR failure. But slowly decreasing mem_used obviously indicates the issue.
Comment by Pavel Paulau [ 26/Jul/14 ]
Just spotted again in different scenario, build 3.0.0-1024. Proof: https://s3.amazonaws.com/bugdb/jira/MB-11779/172.23.96.11.zip .
Comment by Sriram Ganesan [ 28/Jul/14 ]
Pavel

Thanks for uploading those logs. I see a bunch vbucket deletion messages in the test

Fri Jul 25 07:33:16.745484 PDT 3: (bucket-10) Deletion of vbucket 1023 was completed.
Fri Jul 25 07:33:16.745619 PDT 3: (bucket-10) Deletion of vbucket 1022 was completed.
Fri Jul 25 07:33:16.745739 PDT 3: (bucket-10) Deletion of vbucket 1021 was completed.
Fri Jul 25 07:33:16.745887 PDT 3: (bucket-10) Deletion of vbucket 1020 was completed.
Fri Jul 25 07:33:16.746005 PDT 3: (bucket-10) Deletion of vbucket 1019 was completed.
Fri Jul 25 07:33:16.746177 PDT 3: (bucket-10) Deletion of vbucket 1018 was completed.

This seems to the case for all the buckets. But the GIGANTOR message only shows up for 5 of the buckets. Are these logs from the same test? Are you doing any forced shutdown of any of the buckets in your test? Apparently there is a known issue in ep-engine according to Chiyoung at bucket shutdown time and the GIGANTOR message can manifest only affecting the bucket that is shutdown.
Comment by Sriram Ganesan [ 28/Jul/14 ]
Also, please confirm if any rebalance operations were done in the logs uploaded on the 25th.
Comment by Pavel Paulau [ 28/Jul/14 ]
Sriram,

Logs are from different test/setup (with 10 buckets).

There was only one rebalance event during initial cluster setup:

2014-07-25 07:33:07.970 ns_orchestrator:4:info:message(ns_1@172.23.96.11) - Starting rebalance, KeepNodes = ['ns_1@172.23.96.11','ns_1@172.23.96.12',
                                 'ns_1@172.23.96.13','ns_1@172.23.96.14'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
2014-07-25 07:33:07.995 ns_orchestrator:1:info:message(ns_1@172.23.96.11) - Rebalance completed successfully.

10 buckets were created after that:

2014-07-25 07:33:13.674 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:13.784 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.005 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.006 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:14.031 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:14.082 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.14' in 1 seconds.
2014-07-25 07:33:14.384 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.385 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.588 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:14.682 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.11' in 1 seconds.
2014-07-25 07:33:15.107 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-1" loaded on node 'ns_1@172.23.96.12' in 1 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-2" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.110 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.111 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.218 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.219 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.303 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.304 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-3" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.305 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.312 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-7" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-6" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-4" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-5" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.313 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-9" loaded on node 'ns_1@172.23.96.11' in 0 seconds.
2014-07-25 07:33:15.610 ns_memcached:0:info:message(ns_1@172.23.96.12) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.12' in 0 seconds.
2014-07-25 07:33:15.716 ns_memcached:0:info:message(ns_1@172.23.96.14) - Bucket "bucket-10" loaded on node 'ns_1@172.23.96.14' in 0 seconds.
2014-07-25 07:33:15.802 ns_memcached:0:info:message(ns_1@172.23.96.13) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.13' in 0 seconds.
2014-07-25 07:33:15.811 ns_memcached:0:info:message(ns_1@172.23.96.11) - Bucket "bucket-8" loaded on node 'ns_1@172.23.96.11' in 0 seconds.

Basically bucket shutdown wasn't forced. All those operations are quite normal.

Also from logs I can see underflow issue only in "bucket-10".
Comment by Pavel Paulau [ 30/Jul/14 ]
Hi Sriram,

I can start the test which will reproduce the issue. Will live cluster help?
Comment by Sriram Ganesan [ 30/Jul/14 ]
Pavel

I was planning on providing a toy build today. I need to do more local testing in my environment before I can provide it. The current theory is that the root cause actually happens much earlier itself messing up the accounting and eventually leads to an underflow. I shall try to do so before noon today.
Comment by Sriram Ganesan [ 30/Jul/14 ]
Pavel

Please try the following toy build http://latestbuilds.hq.couchbase.com/couchbase-server-community_cent58-3.0.0-toy-sriram-x86_64_3.0.0-712-toy.rpm. The memcached process will crash once it hits a particular condition which could be the manifestation of the bug.You will likely hit that right after the updates start. Also, if you can run the test in a regular build but just with one node to see if we hit this problem. If you don't hit it, we can rule out other areas and DCP/UPR is more likely to be a potential culprit.
Comment by Pavel Paulau [ 31/Jul/14 ]
It does crash almost immediately, backtraces are attached.

Logs:
http://ci.sc.couchbase.com/job/perf-dev/540/artifact/172.23.100.17.zip
http://ci.sc.couchbase.com/job/perf-dev/540/artifact/172.23.100.18.zip
Comment by Pavel Paulau [ 31/Jul/14 ]
Memory usage seems stable in single node setup. I made only run though.
Comment by Sriram Ganesan [ 31/Jul/14 ]
Thanks Pavel. The crash probably establishes that an allocation for an object is accounted for one bucket (or in no buckets if it is in memcached layer) and deallocation in a different bucket and the fact that the single node setup is quite stable might point to DCP being the more likely culprit here.
Comment by Sriram Ganesan [ 31/Jul/14 ]
Pavel

Would it be possible get access to your environment and also instructions to run the test? I tried looking at the DCP code for any hints but couldn't make any possible guesses. Debugging in your environment where I can continuously add logs could be more helpful.

Thanks
Sriram




[MB-11675] 40-50% performance degradation on append-heavy workload compared to 2.5.1 Created: 09/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: OS X Mavericks 10.9.3
CB server 3.0.0-918 (http://packages.northscale.com/latestbuilds/3.0.0/couchbase-server-enterprise_x86_64_3.0.0-918-rel.zip)
Haswell MacBook Pro (16GB RAM)

Attachments: PNG File CB 2.5.1 revAB_sim.png     PNG File CB 3.0.0-918 revAB_sim.png     JPEG File ep.251.jpg     JPEG File ep.300.jpg     JPEG File epso.251.jpg     JPEG File epso.300.jpg     Zip Archive MB-11675.trace.zip     Zip Archive perf_report_result.zip     Zip Archive revAB_sim_v2.zip     Zip Archive revAB_sim.zip    
Issue Links:
Relates to
relates to MB-11642 Intra-replication falling far behind ... Closed
relates to MB-11623 test for performance regressions with... In Progress

 Description   
When running an append-heavy workload (modelling a social network address book, see below) the performance of CB has dropped from ~100K ops down to 50K ops compared to 2.5.1-1083 on OS X.

Edit: I see a similar (but slightly smaller - around 40% degradation on Linux (Ubuntu 14.04) - see comment below for details.

== Workload ==

revAB_sim - generates a model social network, then builds a representation of this in Couchbase. Keys are a set of phone numbers, values are lists of phone books which contain that phone number. (See attachment).

Configured for 8 client threads, 100,000 people (documents).

To run:

* pip install networkx
* Check revAB_sim.py for correct host, port, etc
* time ./revAB_sim.py

== Cluster ==

1 node, default bucket set to 1024MB quota.

== Runtimes for workload to complete ==


## CB-2.5.1-1083:

~107K op/s. Timings for workload (3 samples):

real 2m28.536s
real 2m28.820s
real 2m31.586s


## CB-3.0.0-918

~54K op/s. Timings for workload:

real 5m23.728s
real 5m22.129s
real 5m24.947s


 Comments   
Comment by Pavel Paulau [ 09/Jul/14 ]
I'm just curious, what does consume all CPU resources?
Comment by Dave Rigby [ 09/Jul/14 ]
I haven't had chance to profile it yet; certainly in both instances (fast / slow) the CPU is at 100% between the client workload and server.
Comment by Pavel Paulau [ 09/Jul/14 ]
Is memcached top consumer? or beam.smp? or client?
Comment by Dave Rigby [ 09/Jul/14 ]
memcached highest (as expected). From the 3.0.0 package (which I still have installed):

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG CMPRS VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW
34046 memcached 476.9 01:34.84 17/7 0 36 419 278M+ 277M+ 0B 0B 348M 2742M 34046 33801 running 501 73397+ 160 67 26 13304643+ 879+ 4070244+
34326 Python 93.4 00:18.57 9/1 0 25 418 293M+ 293M+ 0B 0B 386M 2755M 34326 1366 running 501 77745+ 399 70 28 15441263+ 629 5754198+
0 kernel_task 71.8 00:14.29 95/9 0 2 949 1174M+ 30M 0B 0B 295M 15G 0 0 running 0 42409 0 57335763+ 52435352+ 0 0 278127194+
...
32800 beam.smp 8.5 00:05.61 30/4 0 49 330 155M- 152M- 0B 0B 345M- 2748M- 32800 32793 running 501 255057+ 468 149 30 6824071+ 1862753+ 1623911+


Python is the workload generator.

I shall try to collect an Instruments profile of 3.0 and 2.5.1 to compare...
Comment by Dave Rigby [ 09/Jul/14 ]
Instruments profile of two runs:

Run 1: 3.0.0 (slow)
Run 2: 2.5.1 (fast)

I can look into the differences tomorrow if no-one else gets there first.


Comment by Dave Rigby [ 10/Jul/14 ]
Running on Linux (Ubuntu 14.04), 24 core Xeon, I see a similar effect, but the magnitude is not as bad - 40% performance drop.

100,000 documents with 4 worker threads, same bucket size (1024MB). (Note: worker threads was dropped to 4 as I couldn't get Python SDK to reliably connect with 8 threads at the same time).

## CB-3.0.0 (source build):

    83k op/s
    real 3m26.785s

## CB-2.5.1 (source build):

    133K op/s
    real 2m4.276s


Edit: Attached updated zip file as: revAB_sim_v2.zip
Comment by Dave Rigby [ 10/Jul/14 ]
Attaching the output of `perf report` for both 2.5.1 and 3.0.0 - perf_report_result.zip

There's nothing obvious jumping out at me, looks like quite a bit has changed between the two in ep_engine.
Comment by Dave Rigby [ 11/Jul/14 ]
I'm tempted to bump this to "blocker" considering it also affects Linux - any thoughts?
Comment by Pavel Paulau [ 11/Jul/14 ]
It's a product/release blocker, no doubt.

(though raising priority at this point will not move ticket to the top of the backlog due to other issues)
Comment by Dave Rigby [ 11/Jul/14 ]
@Pavel done :)
Comment by Abhinav Dangeti [ 11/Jul/14 ]
Think I should bring this up to people's notice that JSON detection has been moved to before items are set in memory, in 3.0. This could very well be the cause for this regression (as previously, we did do this JSON check but just before persistence).
This was part of the datatype related change, now required by UPR.
A HELLO protocol was introduced new in 3.0, which clients can invoke, there by letting the server know that clients would be setting the datatype themselves, in which case this JSON check wouldn't take place.
If a client doesn't invoke the HELLO command, then we would do JSON detection to set the datatype correctly.

However, recently, the HELLO was disabled as we weren't ready to handle compressed documents in view engine. This implied that we do a mandatory JSON check for every store operation, before setting the document even in memory.
Comment by Cihan Biyikoglu [ 11/Jul/14 ]
Thanks Abhinav. Can we try out if this simply resolves the issue quickly and if this is proven, revert this change?
Comment by David Liao [ 14/Jul/14 ]
I tried testing using the provided scripts with and without the json checking logic and there is no difference (on Mac and Ubuntu).

The total size of data is less than 200 MB with 100K items, it's about <2K per item which is not very big.
Comment by David Liao [ 15/Jul/14 ]
There might be an issue with general disk operation. I tested the set and it shows the same performance difference as append.
Pavel, have you seen any 'set' performance drop with 3.0? There is no rebalance involved just a single node in this test.
Comment by Pavel Paulau [ 16/Jul/14 ]
3.0 performs worse in CPU bound scenarios.
However Dave observed the same issue on system with 24 vCPU, which is kind of confusing to me.
Comment by Pavel Paulau [ 16/Jul/14 ]
Meanwhile I tried that script in my environment. I see no difference between and 2.5.1 and 3.0.

3.0.0-969: real 3m30.530s
2.5.1-1083: real 3m28.911s

Peak throughput is about 80K in both cases.

h/w configuration:

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

I used a standalone server as test client and regular packages.
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: I was essentially maxing out the system, so that probably explains why even with 24 cores I could see the issue.
Comment by Pavel Paulau [ 16/Jul/14 ]
Does it mean that 83/133K ops/sec saturate system with 24 cores?
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: yes (including the client workload generator which was running on the same machine). I could possibly push it higher by increasing the client worker threads, but as mentioned I had some python SDK connection issues then.

Comment by Pavel Paulau [ 16/Jul/14 ]
Weird, in my case CPU utilization was less than 500% (IRIX mode).
Comment by David Liao [ 16/Jul/14 ]
I am using a 4-core/4 GB ubuntu VM for the test.

3.0
real 11m16.530s
user 2m33.814s
sys 2m35.779s
<30k ops

2.5.1
real 7m6.843s
user 2m6.693s
sys 2m2.696s
40k ops


During today's test, I found out that the disk queue fill/drain rate of 3.0.0 is much smaller than 3.0.0 (<2k vs 30k). The cpu usage is ~8% higher too but most increase is from system cpu usage (total cpu is almost maxed out on 3.0)

Pavel, can you check the disk queue fill/drain rate of your test and system vs user cpu usage?
Comment by Pavel Paulau [ 16/Jul/14 ]
David,

I will check disk stats tmrw. At the time I would recommend you to run benchmark with disabled persistence.
Comment by Pavel Paulau [ 18/Jul/14 ]
In my case drain rate is higher in 2.5.1 (80K vs. 5K) but size of write queue and rate of actual disk creates/updates is pretty much the same.

CPU utilization is 2x higher in 3.0 (400% vs. 200%).

However I don't understand how this information helps.
Comment by David Liao [ 21/Jul/14 ]
The drain rate may not be accurate on 2.5.1.
 
'iostat' shows about 2x 'tps' and 'KB_wrtn/s' for 3.0.0 vs 2.5.1. So it indicates far more disk activities in 3.0.0.

We need to find out what the extra disk activities are. Since ep-engine issues "set" to couchstore which then write to disk, we should
do a benchmark against the couchstore separately to isolate problem.

Pavel, is there a way to do a couchstore performance test?
Comment by Pavel Paulau [ 22/Jul/14 ]
Due to increased number of flusher threads 3.0.0 persist data faster, that must explain higher disk activity.

Once again, disabling disk persistence at all will eliminate "disk" factor (just as an experiment).

Also I don't think that we made any critical changes in couchstore, I don't expect any regression. Chiyoung may have some benchmarks.
Comment by David Liao [ 22/Jul/14 ]
I have played with different flusher threads but don't see any improvement in my own not-designed-for-serious-performance-testing environment.

Logically, if flusher threads runs faster, it means the total number of transfer to disk should finish in shorter time. My observation is higher TPS lasted during the entire testing period which itself is much longer than 2.5.1 which means the total disk activities TPS and date_writte_disk for the same amount of work load is much higher.

Do you mean using memcached bucket when you say "disabling disk"? That test shows much less performance degradation which means majority of the problem is not from the memcached layer.

I am not familiar with couchstore changes but there are indeed quite a lot and not sure who is responsible for that component. But still it needs to be tested just like any other component.
Comment by Pavel Paulau [ 23/Jul/14 ]
I meant disabling persistence to disk in couchbase bucket. E.g., using cbepctl.
Comment by David Liao [ 23/Jul/14 ]
I disabled persistence with cbepctl and reran the tests and got the same performance degradation:

3.0.0:
real 6m3.988s
user 1m59.670s
sys 2m1.363s
ops: 50k

2.5.1
real 4m18.072s
user 1m45.940s
sys 1m39.775s
ops: 70k

So it's not the disk related operations that caused this.
Comment by David Liao [ 24/Jul/14 ]
Dave, what profiling tool did u use to collect the profiling data you attached?
Comment by Dave Rigby [ 24/Jul/14 ]
I used Linux perf - see for example http://www.brendangregg.com/perf.html
Comment by David Liao [ 25/Jul/14 ]
attach perf report for ep.so 3.0.0
Comment by David Liao [ 25/Jul/14 ]
perf report ep.so 2.51
Comment by David Liao [ 25/Jul/14 ]
I attached memcached and ep.so cpu usage for both 3.00 and 2.5.1.

The 2.5.1 didn't use c++ atomics. I tested 3.0.0 without c++ atomics and see the following improvement: ~20% diff.

Both with persistence disabled.

2,51
real 7m38.581s
user 2m11.771s
sys 2m27.968s
ops: 35k+

3.0.0
real 9m15.638s
user 2m31.642s
sys 2m56.154s
ops: ~30k

There could be multiple things that we still need to look at: the threading change in 3.0.0 and thus figuring out the best number of thread for different work load and also why there are much more data being written to disk in this work load.

I am using my laptop doing the perf testing but this kind of test should be done using dedicated/controlled testing environment.
So the perf team should try test the following areas:
1. c++ atomics change.
2. different threading configuration for different type of workload
3. independent couchstore testing decoupled from ep-engine.

Comment by Pavel Paulau [ 26/Jul/14 ]
As I mentioned before, I don't see difference between 251 and 300 using "dedicated/controlled testing environment.".

Anyways, thanks for your deep investigation. I will try to reproduce the issue on my laptop.

cc Thomas
Comment by Thomas Anderson [ 29/Jul/14 ]
in both cases, append heavy workloads, where Sets are > 50Kops, performance degradation seen in early 3.0.0 builds. collateral symptoms were 1) increase in bytes written to store/disk, approximately 20%, 2) frequency of bucket compression (log shows bucket ranges being compressed overlap) 3) drop off of OPS over time.
starting with build 3.0.0-1037, these performance metrics are generally aligned/equivalent to 2.5.1. 1) frequency of bucket compression reduced 2) expansion of bytes written reduced to almost 1-1 3) OPS contention/slowdown does not occur.

test is 10 concurrent loaders, 1024 byte document (JSON or not-JSON) averaging ~80kOPS.
Comment by Dave Rigby [ 30/Jul/14 ]
TL;DR: 3.0 release debs appear to be built *without* optimisation (!)

On a hunch I thought I'd see how we are building 3.0.0, as it seemed a little surprising we saw symbols for C++ atomics as I would have expected them to be inlined. Looking at the build log [1], I see we are building the .deb package as Debug, without optimisation:

    (cd build && cmake -G "Unix Makefiles" -D CMAKE_INSTALL_PREFIX="/opt/couchbase" -D CMAKE_PREFIX_PATH=";/opt/couchbase" -D PRODUCT_VERSION=3.0.0-1059-rel -D BUILD_ENTERPRISE=TRUE -D CMAKE_BUILD_TYPE=Debug -D CB_DOWNLOAD_DEPS=1 ..)

Note: CMAKE_BUILD_TYPE=****Debug****

From my local Ubuntu build, I see that CXX flags are set to the following for each of Debug / Release / RelWithDebInfo:

    CMAKE_CXX_FLAGS_DEBUG:STRING=-g
    CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
    CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG

For comparision I checked the latest 2.5.1 build [2] (which may not be the same as the last 2.5.1 release) and I see we *did* compile that with -O3 - for example:

    libtool: compile: g++ -DHAVE_CONFIG_H -I. -I./src -pipe -I./include -DHAVE_VISIBILITY=1 -fvisibility=hidden -I./src -I./include -I/opt/couchbase/include -pipe -O3 -O3 -ggdb3 -MT src/ep_la-ep_engine.lo -MD -MP -MF src/.deps/ep_la-ep_engine.Tpo -c src/ep_engine.cc -fPIC -DPIC -o src/.libs/ep_la-ep_engine.o


If someone from build / infrastructure could confirm that would be great, but all the evidence suggests we are building our release packages with no optimisation (!!)

I believe the solution here is to change the invocation of cmake to set CMAKE_BUILD_TYPE=Release.


[1]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-300-builder/builds/1100/steps/couchbase-server%20make%20enterprise%20/logs/stdio
[2]: http://builds.hq.northscale.net:8010/builders/ubuntu-1204-x64-251-builder/builds/38/steps/couchbase-server%20make%20enterprise%20/logs/stdio
Comment by Dave Rigby [ 30/Jul/14 ]
Just checked RHEL - I see the same.

3.0:

    (cd build && cmake -G "Unix Makefiles" <cut> -D CMAKE_BUILD_TYPE=Debug <cut>

    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-300-builder/builds/1095/steps/couchbase-server%20make%20enterprise%20/logs/stdio


2.5.1:

    libtool: compile: g++ <cut> -O3 -c src/ep_engine.cc -o src/.libs/ep_la-ep_engine.o
    
    Full logs: http://builds.hq.northscale.net:8010/builders/centos-6-x64-251-builder/builds/42/steps/couchbase-server%20make%20enterprise%20/logs/stdio


Comment by Dave Rigby [ 30/Jul/14 ]
I've separated the "packages build as debug" problem out into it's own defect (MB-11854)
Comment by Sundar Sridharan [ 30/Jul/14 ]
Dave, I am unable to verify this at this time. could you please let me know if you are still see this issue on builds with optimizations enabled? thanks
Comment by Chiyoung Seo [ 30/Jul/14 ]
Thanks Dave for identifying the build issue.

Enabling "-O3" optimization will make a huge difference in the performance. We should set CMAKE_BUILD_TYPE=Release in 3.0 builds for a fair comparison.
Comment by Chris Hillery [ 30/Jul/14 ]
For those not watching CBD-1422 (nee MB-11854), I have pushed a fix for master and am verifying. I will update this bug when there is a 3.0 Release-mode build, hopefully later tonight.
Comment by Chris Hillery [ 31/Jul/14 ]
3.0.0 build 1068 is being built in Release mode. Should be uploaded in the next half-hour or so.
Comment by Wayne Siu [ 31/Jul/14 ]
Dave,
Can you run your test with the latest build and see if you still see the issue?
Comment by Chris Hillery [ 31/Jul/14 ]
Thomas Anderson has been testing with an updated build and mentioned significant improvement.




[MB-11862] couchbase cli in cluster-wide collectinfo stop shows SUCCESS even thought no process running at all Created: 31/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: No

 Description   
Install couchbase server 3.0.0-1074 on a ubuntu 12.04 node
Run couchbase cli for cluster-wide collectinfo collect-logs-stop
It shows SUCCESS even thought no process running at all


root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 12.11.10.130:8091 -u Administrator -p password
SUCCESS: collect logs successfully stopped


 Comments   
Comment by Steve Yen [ 31/Jul/14 ]
If I understand this bug report right...

- the collectinfo job isn't running.

- then you run the command to stop the collectinfo job (which actually isn't running).

- and, the command finishes with a message of "SUCCESS: collect logs successfully stopped" (so, it's not running).

- and, the collectinfo job still (correctly) is not running afterwards.

I think English is ambiguous enough to allow this to be a Won't Fix, or at least won't need to be in 3.0.0 if somebody decides that the output message needs to be changed to "SUCCESS: collect logs is no longer running" or equivalent.




[MB-11203] SSL-enabled memcached will hang when given a large buffer containing many pipelined requests Created: 24/May/14  Updated: 31/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Mark Nunberg Assignee: Jim Walker
Resolution: Unresolved Votes: 0
Labels: memcached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Sample code which shows filling in a large number of pipelined requests being flushed over a single buffer.

#include <libcouchbase/couchbase.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int remaining = 0;

static void
get_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_get_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }
    remaining--;
}

static void
stats_callback(lcb_t instance, const void *cookie, lcb_error_t err,
    const lcb_server_stat_resp_t *resp)
{
    printf("Remaining: %d \r", remaining);
    fflush(stdout);
    if (err != LCB_SUCCESS && err != LCB_KEY_ENOENT) {
    }

    if (resp->v.v0.server_endpoint == NULL) {
        fflush(stdout);
        --remaining;
    }
}

#define ITERCOUNT 5000
static int use_stats = 1;

static void
do_stat(lcb_t instance)
{
    lcb_CMDSTATS cmd;
    memset(&cmd, 0, sizeof(cmd));
    lcb_error_t err = lcb_stats3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

static void
do_get(lcb_t instance)
{
    lcb_error_t err;
    lcb_CMDGET cmd;
    memset(&cmd, 0, sizeof cmd);
    LCB_KREQ_SIMPLE(&cmd.key, "foo", 3);
    err = lcb_get3(instance, NULL, &cmd);
    assert(err==LCB_SUCCESS);
}

int main(void)
{
    lcb_t instance;
    lcb_error_t err;
    struct lcb_create_st cropt = { 0 };
    cropt.version = 2;
    char *mode = getenv("LCB_SSL_MODE");
    if (mode && *mode == '3') {
        cropt.v.v2.mchosts = "localhost:11996";
    } else {
        cropt.v.v2.mchosts = "localhost:12000";
    }
    mode = getenv("USE_STATS");
    if (mode && *mode != '\0') {
        use_stats = 1;
    } else {
        use_stats = 0;
    }
    err = lcb_create(&instance, &cropt);
    assert(err == LCB_SUCCESS);


    err = lcb_connect(instance);
    assert(err == LCB_SUCCESS);
    lcb_wait(instance);
    assert(err == LCB_SUCCESS);
    lcb_set_get_callback(instance, get_callback);
    lcb_set_stat_callback(instance, stats_callback);
    lcb_cntl_setu32(instance, LCB_CNTL_OP_TIMEOUT, 20000000);
    int nloops = 0;

    while (1) {
        unsigned ii;
        lcb_sched_enter(instance);
        for (ii = 0; ii < ITERCOUNT; ++ii) {
            if (use_stats) {
                do_stat(instance);
            } else {
                do_get(instance);
            }
            remaining++;
        }
        printf("Done Scheduling.. L=%d\n", nloops++);
        lcb_sched_leave(instance);
        lcb_wait(instance);
        assert(!remaining);
    }
    return 0;
}


 Comments   
Comment by Mark Nunberg [ 24/May/14 ]
http://review.couchbase.org/#/c/37537/
Comment by Mark Nunberg [ 07/Jul/14 ]
Trond, I'm assigning it to you because you might be able to delegate this to another person. I can't see anything obvious in the diff since the original fix which would break it - of course my fix might have not fixed it completely but just have made it work accidentally; or it may be flush-related.
Comment by Mark Nunberg [ 07/Jul/14 ]
Oh, and I found this on an older build of master; 837, and the latest checkout (currently 055b077f4d4135e39369d4c85a4f1b47ab644e22) -- I don't think anyone broke memcached - but rather the original fix was incomplete :(
Comment by Wayne Siu [ 31/Jul/14 ]
Jim,
Can you give us an update on this ticket? If possible, an ETA? Thanks.




[MB-11863] Checkpoint information is not persisted in _local/vbstate document leading to loss of all checkpoint information after restart. Created: 31/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Venu Uppalapati
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 64 bit build 1075

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
All checkpoint information is lost after restart:
1)Insert 100k items in default bucket.
2)Use couchstore API couchstore_open_local_document to get the vbstate json from a vbucket and read the checkpoint id. It is zero.
3)query stats for checkpoint using cbstats, it is different example:6
4)restart restart gracefully using restart command line command
5)use couchstore API to read the checkpoint information again, it is zero.
6)use cbstats to query for checkpoint information. this time it shows zero.
7)All checkpoint information is lost after restart.

 Comments   
Comment by Sundar Sridharan [ 31/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/#/c/40155/ thanks Venu!
Comment by Chiyoung Seo [ 31/Jul/14 ]
Merged.




[MB-11839] Significant drop in performance for SSL + XDCR during failover in compare to non-SSL+XDCR + failover case Created: 29/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sangharsh Agarwal Assignee: Sangharsh Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-0132-rel

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Copying comments from MB-11440.

[Points to Highlight]

Problem is always appearing with SSL-XDCR only. Same test is always passing with non-SSL XDCR test always.

[Test Conditions]
It is found that test is always reproducible if Failover side have 3 nodes and other side have 4 nodes. i.e. After failover+rebalance there should be 2 nodes. e.g. I tried this test with 4 nodes cluster and test passed.
After analysis of the test it found that updates are replicated to other side very slowly that caused this issue.

[Test Steps]
1. Have 3 nodes Source cluster (S) , 4 nodes Destination cluster (D).
2. Create two buckets sasl_bucket_1, sasl_bucket_2.
3. Setup SSL Bi-directional XDCR (CAPI) for both buckets.
4. Load 10000 items on each bucket and Source. keys with prefix "loadOne".
5. Load 10000 items on each bucket and Source. keys with prefix "loadTwo".
6. Wait for 3 minutes to ensure of items are replicated to either ends.
7. Failover+Rebalance one node at Source cluster.
8. Perform Updates (3000) and Delete(3000) items on Source. keys with prefix "loadOne".
9. Perform Updates (3000 items) on Destination. keys with prefix "loadTwo".
10. Test will fail with data mismatch error the data on Source (S) and Destination (D). It is the case that key from Destination (D i.e. non-failover side) i.e. "loadTwo" were not replicated when validation took place.


[Additional information]
1. Test with lesser number of items/updates are passed successfully.
2. Test with single bucket is passed with above mentioned items/mutations.

[Workaround]
Increase timeout to 5 minutes (from 3 minutes) to wait for outbound mutations to zero, which will ensure that all data is replicated from either side in bi-directional replication. But XDCR with UPR should be even more faster than previous XDCR.





 Comments   
Comment by Sangharsh Agarwal [ 29/Jul/14 ]
Uploading the logs for the same test in ssl and non-ssl scenario for better comparison, test were performed on build 3.0.0-1032-rel.

Refer to test steps well as this bug is appearing only for Bi XDCR where Source cluster (3 nodes) and Destination cluster (4 nodes).
Comment by Sangharsh Agarwal [ 29/Jul/14 ]
Test logs for SSL + XDCR + Failover:

[Test Logs]
test_ssl_failover.log : https://s3.amazonaws.com/bugdb/jira/MB-11839/c0026ab8/test_ssl_failover.log

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/58d5fc3b/10.1.3.96-7292014-221-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9e8fc3ef/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/a9a09431/10.1.3.96-7292014-28-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/1af6c6e8/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/41e904d3/10.1.2.12-7292014-224-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/b1f94b78/10.1.2.12-7292014-28-couch.tar.gz

10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11839/4330457a/10.1.3.97-diag.txt.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11839/8965931e/10.1.3.97-7292014-220-diag.zip

[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9f4b8681/10.1.3.93-7292014-28-couch.tar.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/b21a247b/10.1.3.93-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/d71032f9/10.1.3.93-7292014-215-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/94a7b502/10.1.3.94-7292014-218-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/d416acfe/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/f84a83bf/10.1.3.94-7292014-29-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/55a62f10/10.1.3.95-7292014-28-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9b41d727/10.1.3.95-7292014-217-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/f15a92d3/10.1.3.95-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/33f48efd/10.1.3.99-7292014-29-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/482438fd/10.1.3.99-7292014-222-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/8cf9c9d7/10.1.3.99-diag.txt.gz



[Observation]

Failover :-

2014-07-29 01:56:47 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 10.1.3.97:8091
2014-07-29 01:57:40 | INFO | MainProcess | Cluster_Thread | [task.check] rebalancing was completed with progress: 100% in 50.1205019951 sec

[3000 Updation]
2014-07-29 01:59:47 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_1
2014-07-29 01:59:47 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_1
2014-07-29 01:59:47 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_1
2014-07-29 01:59:48 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_1
2014-07-29 01:59:48 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_2
2014-07-29 01:59:49 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_2
2014-07-29 01:59:49 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_2
2014-07-29 01:59:50 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_2


[XDCR Outbound mutations goes to 0]
2014-07-29 02:06:23 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_1 is 0
2014-07-29 02:06:24 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_2 is 0

Approximate 7 minutes to replicate 3000 items.
Comment by Sangharsh Agarwal [ 29/Jul/14 ]
Test Logs non-SSL XDCR:

[Test log]
test_non_ssl_failover.log : https://s3.amazonaws.com/bugdb/jira/MB-11839/4481d160/test_non_ssl_failover.log

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/1bf90976/10.1.3.96-7292014-244-couch.tar.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/bb98407b/10.1.3.96-7292014-255-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/e67f8aef/10.1.3.96-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/3893fa4d/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/74885637/10.1.2.12-7292014-258-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/c4696897/10.1.2.12-7292014-245-couch.tar.gz

10.1.3.97 (failover node) : https://s3.amazonaws.com/bugdb/jira/MB-11839/07a166dd/10.1.3.97-diag.txt.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11839/a7f46f17/10.1.3.97-7292014-254-diag.zip

[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/140e4b7b/10.1.3.93-7292014-250-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/24721b49/10.1.3.93-7292014-245-couch.tar.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/5244684f/10.1.3.93-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/41b3b5a1/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/7a81cb86/10.1.3.94-7292014-253-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9d8d5208/10.1.3.94-7292014-245-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/37adb645/10.1.3.95-7292014-245-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/50bf33f9/10.1.3.95-7292014-251-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/f4cd038e/10.1.3.95-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/6ae1866c/10.1.3.99-7292014-245-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9d770cf0/10.1.3.99-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/f3b3d75d/10.1.3.99-7292014-256-diag.zip

[Observation]

Failover:-
2014-07-29 02:38:36 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 10.1.3.97:8091
2014-07-29 02:39:29 | INFO | MainProcess | Cluster_Thread | [task.check] rebalancing was completed with progress: 100% in 50.1264278889 sec

[Updation]
2014-07-29 02:41:37 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_1
2014-07-29 02:41:37 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_1
2014-07-29 02:41:38 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_1
2014-07-29 02:41:38 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_1
2014-07-29 02:41:38 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_2
2014-07-29 02:41:39 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_2
2014-07-29 02:41:39 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_2
2014-07-29 02:41:40 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_2


[XDCR finished]
2014-07-29 02:42:43 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_1 is 0
2014-07-29 02:42:44 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_2 is 0

Approximate 1 minute to replication same number of mutations.
Comment by Sangharsh Agarwal [ 29/Jul/14 ]
Logs for both the scenarios are exclusively for respective test run and without cleanup as requests from Dev.

Issue is consistent from the build 3.0.0-814.
Comment by Aleksey Kondratenko [ 29/Jul/14 ]
I need evidence if same issue exists with 2.5.1 on same environment.
Comment by Sangharsh Agarwal [ 29/Jul/14 ]
Logs for 2.5.1 SSL+XDCR+Failover

[TestLogs]
https://s3.amazonaws.com/bugdb/jira/MB-11839/d89c5f3d/test_ssl__2.5.1_failover.log

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/8b7e01b5/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/a8957cd4/10.1.3.96-7292014-935-couch.tar.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11839/a94e4c53/10.1.3.96-7292014-944-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/008f9e38/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/395b2024/10.1.2.12-7292014-946-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11839/c17af3b8/10.1.2.12-7292014-935-couch.tar.gz

10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11839/52ccc70a/10.1.3.97-diag.txt.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9b1e595a/10.1.3.97-7292014-943-diag.zip

[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/39d09793/10.1.3.93-7292014-940-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/b7abeb29/10.1.3.93-7292014-935-couch.tar.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11839/fa68a142/10.1.3.93-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/00fd880a/10.1.3.94-7292014-942-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/6abf3bc7/10.1.3.94-7292014-935-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11839/e075675f/10.1.3.94-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/3c8da978/10.1.3.95-7292014-941-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/40c8d780/10.1.3.95-7292014-935-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11839/6d476312/10.1.3.95-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/2e48410c/10.1.3.99-7292014-936-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/9b13c8f6/10.1.3.99-7292014-945-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11839/a59bea66/10.1.3.99-diag.txt.gz


[Observations]

Failover:

2014-07-29 09:26:41 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 10.1.3.97:8091
2014-07-29 09:26:43 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@10.1.3.97 successful
2014-07-29 09:26:43 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
2014-07-29 09:26:43 | INFO | MainProcess | test_thread | [biXDCR.load_with_failover] Failing over Source Non-Master Node 10.1.3.97:8091
2014-07-29 09:26:44 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.97&user=Administrator&knownNodes=ns_1%4010.1.3.97%2Cns_1%4010.1.3.96%2Cns_1%4010.1.2.12
2014-07-29 09:29:44 | INFO | MainProcess | Cluster_Thread | [task.check] rebalancing was completed with progress: 100% in 180.375282049 sec

[Updation]
2014-07-29 09:31:19 | INFO | MainProcess | test_thread | [xdcrbasetests.sleep] sleep for 30 secs. ...
2014-07-29 09:31:49 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_1
2014-07-29 09:31:50 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_1
2014-07-29 09:31:50 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_1
2014-07-29 09:31:51 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_1
2014-07-29 09:31:51 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.93:11210 sasl_bucket_2
2014-07-29 09:31:52 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.94:11210 sasl_bucket_2
2014-07-29 09:31:53 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.95:11210 sasl_bucket_2
2014-07-29 09:31:53 | INFO | MainProcess | test_thread | [data_helper.direct_client] creating direct client 10.1.3.99:11210 sasl_bucket_2


[Completion]
2014-07-29 09:32:59 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_1 is 0
2014-07-29 09:33:00 | INFO | MainProcess | test_thread | [xdcrbasetests.__wait_for_outbound_mutations_zero] Current outbound mutations on cluster node: 10.1.3.93 for bucket sasl_bucket_2 is 0


Replication completed in 1 minute in 2.5.1-1083-rel.
Comment by Anil Kumar [ 29/Jul/14 ]
Sangharsh - Can you mention whats the 'Significant drop in performance' means here?
Comment by Sangharsh Agarwal [ 31/Jul/14 ]
Anil - 'Significant drop in performance', replication of 3000 update takes 1 minutes if non-ssl xdcr v/s 7 minutes with ssl + xdcr in 3.0.0 after topology change (failover).

3.0.0-1083-rel (Non-SSL + XDCR) - 1 Minute to replicate 3000 update after failover.
3.0.0-1083-rel (SSL + XDCR) - 7 Minutes to replicate 3000 update after failover.

For reference stats for 2.5.1-1083-rel:

2.5.1-1083-rel (Non-SSL XDCR) - 1 Minute to replicate 3000 update after failover.
2.5.1-1083-rel (SSL XDCR) - 1 Minute to replicate 3000 update after failover.
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
http://review.couchbase.org/40147




[MB-11861] {DCP} :: Rebalance-in exited with memcached crash Created: 31/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 1:10.6.2.144
2:10.6.2.145
3:10.6.2.146
4:10.6.2.147
5:10.6.2.148
6:10.6.2.149
7:10.6.2.150

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
1072, centos 6x

1. Create 1 node cluster
2. Add default bucket with 100 K items
3. Rebalance in 2 nodes with create ops running in parallel

When the Step 3 is run the third time (with 5 nodes, 2 rebalance-in) it exits and crash dumps are seen

Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 134. Restarting. Messages: Thu Jul 31 15:22:24.799695 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.144->ns_1@10.6.2.145:default - (vb 122) Stream closing, 0 items sent from disk, 572 items sent from memory, 1937 was last seqno sent
Thu Jul 31 15:22:24.799711 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.144->ns_1@10.6.2.145:default - (vb 124) Stream closing, 0 items sent from disk, 526 items sent from memory, 1880 was last seqno sent
Thu Jul 31 15:22:24.815042 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.144->ns_1@10.6.2.149:default - (vb 18) Stream closing, 0 items sent from disk, 0 items sent from memory, 1784 was last seqno sent
Thu Jul 31 15:22:24.815093 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.144->ns_1@10.6.2.149:default - (vb 64) Stream closing, 1689 items sent from disk, 0 items sent from memory, 1743 was last seqno sent
asssertion failed [highSeqno <= vb->getHighSeqno()] at /buildbot/build_slave/centos-5-x64-300-builder/build/build/ep-engine/src/ep.cc:2724

Rebalance exited with reason {unexpected_exit,
{'EXIT',<0.19183.31>,
{wait_seqno_persisted_failed,"default",44,
1677,
[{'ns_1@10.6.2.146',
{'EXIT',
{{badmatch,{error,closed}},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.6.2.146'},
{if_rebalance,<0.18471.31>,
{wait_seqno_persisted,44,1677}},
infinity]}}}}]}}}

<0.19096.31> exited with {unexpected_exit,
{'EXIT',<0.19183.31>,
{wait_seqno_persisted_failed,"default",44,1677,
[{'ns_1@10.6.2.146',
{'EXIT',
{{badmatch,{error,closed}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.146'},
{if_rebalance,<0.18471.31>,
{wait_seqno_persisted,44,1677}},
infinity]}}}}]}}}

Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 134. Restarting. Messages: Thu Jul 31 15:22:24.346280 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.146->ns_1@10.6.2.149:default - (vb 30) Sending disk snapshot with start seqno 0 and end seqno 1762
Thu Jul 31 15:22:24.352786 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.146->ns_1@10.6.2.149:default - (vb 30) Backfill complete, 605 items read from disk, last seqno read: 1762
Thu Jul 31 15:22:24.352815 PDT 3: (default) Backfill task (1 to 1598) finished for vb 30 disk seqno 1762 memory seqno 1762
Thu Jul 31 15:22:24.353154 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@10.6.2.146->ns_1@10.6.2.147:default - (vb 102) Stream closing, 1311 items sent from disk, 581 items sent from memory, 1931 was last seqno sent
asssertion failed [highSeqno <= vb->getHighSeqno()] at /buildbot/build_slave/centos-5-x64-300-builder/build/build/ep-engine/src/ep.cc:2724

test case

./testrunner -i centos_x64_rebalance_in.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets_dist=5 -t rebalance.rebalancein.RebalanceInTests.incremental_rebalance_in_with_ops,replicas=3,items=100000,doc_ops=create,max_verify=100000,GROUP=IN;P2

Attaching logs



 Comments   
Comment by Aruna Piravi [ 31/Jul/14 ]
Means MB-11725 is not completely fixed?
Comment by Chiyoung Seo [ 31/Jul/14 ]
Parag,

We need the gdb backtrace from memcached.
Comment by Parag Agarwal [ 31/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11861/core.memcached.1597.tar.gz

https://s3.amazonaws.com/bugdb/jira/MB-11861/core.memcached.12605.tar.gz

https://s3.amazonaws.com/bugdb/jira/MB-11861/1072.log.tar.gz
Comment by Chiyoung Seo [ 31/Jul/14 ]
I think I figured out the root cause of this issue. I will push the fix soon.
Comment by Parag Agarwal [ 31/Jul/14 ]
Many test cases are failing. Raising the priority

http://qa.sc.couchbase.com/view/3.0.0/job/ubuntu_x64--109_00--Rebalance-In-Out/49/
Comment by Chiyoung Seo [ 31/Jul/14 ]
http://review.couchbase.org/#/c/40151/

Merged.




[MB-9584] icu-config is shipped in our bin directory Created: 18/Nov/13  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.1.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Marty Schoch Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: MacOSX 64-bit

 Description   
Today I ran:

$ icu-config --ldconfig
### icu-config: Can't find /opt/couchbase/lib/libicuuc.dylib - ICU prefix is wrong.
### Try the --prefix= option
### or --detect-prefix
### (If you want to disable this check, use the --noverify option)
### icu-config: Exitting.

I was surprised that icu-config was in the couchbase bin dir, and therefore in my path.

I asked in IRC and no one seemed to think it was actually useful for anything, and it doesn't even appear to output valid paths anyway. Recommend we remove it to avoid any user confusion.

 Comments   
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
It is actually useful on non-osx unix boxes. OSX is unique in a way due to this prefix changing/rewriting/whatever. On GNU/Linux icu we ship is actually usable to compile and link to.
Comment by Chris Hillery [ 31/Jul/14 ]
So, I could either figure out how to make icu-config work on a MacOS build, or eliminate it on MacOS. The latter is certainly quicker so I'll probably do that.




[MB-11642] Intra-replication falling far behind under moderate-heavy workload Created: 03/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Thomas Anderson
Resolution: Fixed Votes: 0
Labels: performance, releasenote
Remaining Estimate: 0h
Time Spent: 47h
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File ep_upr_replica_items_remaining.png     PNG File latency_observe.png     PNG File OPS_during_rebalance.png     PNG File Repl_items_remaining_after_rebalance.png     PNG File Repl_items_remaining_before_rebalance.png     PNG File Repl_items_remaining_during_rebalance.png     PNG File Repl_items_remaining_start_of_rebalance.png     PNG File Screen Shot 2014-07-15 at 11.47.19 AM.png    
Issue Links:
Relates to
relates to MB-11640 DCP Prioritization Open
relates to MB-11675 40-50% performance degradation on app... Open
Triage: Triaged
Is this a Regression?: Yes

 Description   
Running the "standard sales" demo that puts a 50/50 workload of about 80k ops/sec across 4 nodes of m1.xlarge, 1 bucket 1 replica.

The "intra-cluster replication" value grows into the many k's.

This is a value that our users look rather closely at to determine the "safety" of their replication status. A reasonable number on 2.x has always been below 1k but I think we need to reproduce and set appropriate baselines for ourselves with 3.0.

Assigning to Pavel as it falls into the performance area and we would likely be best served if this behavior was reproduced and tracked.

 Comments   
Comment by Pavel Paulau [ 03/Jul/14 ]
Well, I underestimated your definition of moderate-heavy.)

I'm seeing similar issue when load is about 20-30K set/sec. I will create a regular test and will provide all required for debugging information.
Comment by Pavel Paulau [ 09/Jul/14 ]
Just wanted to double check, you can drain 10K documents/sec with both 2.5 and 3.0 builds, is that right?

UPDATE: actually 20K/sec because of replica.
Comment by Pavel Paulau [ 10/Jul/14 ]
In addition to replication queue (see attached screenshot) I measured replicateTo=1.

On average it looks better in 3.0 but there are quite frequent lags as well. Seems to be a regression.

Logs for build 3.0.0-943:
http://ci.sc.couchbase.com/job/ares/308/artifact/

My workload:
-- 4 nodes
-- 1 bucket
-- 40M x 1KB docs (non-DGM)
-- 70K mixed ops/sec (50% reads, 50% updates)
Comment by Perry Krug [ 13/Jul/14 ]
Pavel, I'm still seeing quite a few items sitting in the "intra-replication queue" and some spikes up into the low thousands. I'm using build 957.

The spikes seem possibly related to indexing activity and when I turn XDCR on, it gets _much_ worse.

Let me know if you need any logs from me or anything else I can do to help reproduce and diagnose.
Comment by Pavel Paulau [ 13/Jul/14 ]
Well, initial issue description didn't mention anything about indexing or XDCR.

Do you see problem during KV-only workload? Also logs are required at this point.
Comment by Perry Krug [ 15/Jul/14 ]
Hey Pavel, here is a set of logs from my test cluster running the same workload I described earlier with one design document (two views). This is just the stock beer-sample dataset with a random workload on top of it.

You'll also see a few minutes after this cluster started up, that I turned on XDCR. The "items remaining" in the intra-replication queue shot up to over 1M and have not gone down. It also appears that the UPR drain rate for both XDCR, replication and views has nearly stopped completely with very sporadic spikes (see the recently attached screenshot)

I'm raising this to a blocker since it seems quite significant that the addition of XDCR was able to completely stop the URP drain rates and has so negatively impacted our HA replication within a cluster.

Logs are at:
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.196.75.236.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.196.81.119.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.198.10.83.zip
http://s3.amazonaws.com/customers.couchbase.com/mb_11642/collectinfo-2014-07-15T155200-ns_1%4010.198.52.42.zip

This is on build 957
Comment by Perry Krug [ 15/Jul/14 ]
Just tested on build 966 and seeing similar behavior:
With above workload (70k ops/sec 50/50 gets/sets):
-K/V-only, the intra-replication queue is around 200-400, with an occasional spike up to 10k. I believe this still to be unacceptable and for our customers to complain about it
-K/V+views: intra-replication queue baseline around 1-2k, with frequent spikes up to 4k-5k.
-K/V+XDCR (no views): intra-replication queue immediately begins growing significantly when XDCR is added. The drain rate for intra-replication is sometimes half what the drain rate for XDCR is (mb-11640 seems clearly needed here). The intra-replication queue reaches about 300k (across the cluster) and then starts going down once the XDCR items have been drained. It then hovers just under 200k. Again, not so acceptable.
-K/V+XDCR+1DD/2Views: Again, the intra-replication queue grows and does not seem to recover. It reaches over 1M and then the bucket ran out of memory and everything seemed to shut down. I'll be filing a separate issue for that.
Comment by Pavel Paulau [ 19/Jul/14 ]
It's actually easily reproducible, even in KV cases.

It appears that infrequent sampling misses most spikes, but "manual" observation detects occasional bursts (up to 60-70K in my run).
Comment by Thomas Anderson [ 31/Jul/14 ]
retest of application with latest 3.0.0-1069 build shows minor regression compared with 2.5.1 for intra-replication.
same 4 node server system; 2 views, KV document, target 80K OPS 50/50 ratio; added replicate node; rebalanced; performance comparible.
in 2.5.1. steady state ep_upr_replica_items_remainaining 200-400 with periodic 10K spikes
in 3.0.0-1069. steady state before and after add node/rebalance, ep_upr_replica_items_remaining 200-500 with periodic 60K-75K spikes, no more frequently than 2.5.1.
in 2.5.1. OPS ~80KOPS (with 2 views) in 3.0.0-1069 ~77KOPS; both drop to 60KOPS during rebalance
Comment by Thomas Anderson [ 31/Jul/14 ]
believe 3.0.0-1069 shown to address most of the regression from 2.5.1: OPS and intra-replication queue depth originally reported.




[MB-11761] Modifying remote cluster settings from non-ssl to ssl immidiately after upgrade from 2.5.1-3.0.0 failed but passed after waiting for 2 minutes. Created: 17/Jul/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Sangharsh Agarwal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Upgrade from 2.5.1-1083 - 3.0.0-973

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-11761/b04bb6e3/10.3.3.126-diag.txt.gz
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-11761/e725c091/10.3.3.126-7172014-1030-diag.zip
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-11761/69f350e4/10.3.5.11-7172014-1032-diag.zip
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-11761/e0d1910f/10.3.5.11-diag.txt.gz

[Destination]
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-11761/204af908/10.3.5.60-7172014-1035-diag.zip
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-11761/feb4ca2f/10.3.5.60-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-11761/0c778e33/10.3.5.61-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-11761/58111486/10.3.5.61-7172014-1033-diag.zip
Is this a Regression?: Unknown

 Description   
Setting up non-ssl XDCR between Source and Destination with version 2.5.1-1083.
Upgrade remote clusters to 3.0.0-973 and change settings to SSL failed immediately.

[2014-07-17 10:28:29,416] - [rest_client:747] ERROR - http://10.3.5.61:8091/pools/default/remoteClusters/cluster0 error 400 reason: unknown {"_":"Error {{tls_alert,\"unknown ca\"},\n [{lhttpc_client,send_request,1,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,199}]},\n {lhttpc_client,execute,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,151}]},\n {lhttpc_client,request,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,83}]}]} happened during REST call get to http://10.3.3.126:18091/pools."}
[2014-07-17 10:28:29,416] - [rest_client:821] ERROR - /remoteCluster failed : status:False,content:{"_":"Error {{tls_alert,\"unknown ca\"},\n [{lhttpc_client,send_request,1,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,199}]},\n {lhttpc_client,execute,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,151}]},\n {lhttpc_client,request,9,\n [{file,\"/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl\"},\n {line,83}]}]} happened during REST call get to http://10.3.3.126:18091/pools."}
ERROR
[2014-07-17 10:28:29,418] - [xdcrbasetests:158] WARNING - CLEANUP WAS SKIPPED

======================================================================
ERROR: offline_cluster_upgrade (xdcr.upgradeXDCR.UpgradeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/xdcr/upgradeXDCR.py", line 195, in offline_cluster_upgrade
    self._modify_clusters(None, self.dest_master, remote_cluster['name'], self.src_master, require_encryption=1)
  File "pytests/xdcr/xdcrbasetests.py", line 1123, in _modify_clusters
    demandEncryption=require_encryption, certificate=certificate)
  File "lib/membase/api/rest_client.py", line 835, in modify_remote_cluster
    self.__remote_clusters(api, 'modify', remoteIp, remotePort, username, password, name, demandEncryption, certificate)
  File "lib/membase/api/rest_client.py", line 822, in __remote_clusters
    raise Exception("remoteCluster API '{0} remote cluster' failed".format(op))
Exception: remoteCluster API 'modify remote cluster' failed

----------------------------------------------------------------------
Ran 1 test in 614.847s

[Jenkins]
http://qa.hq.northscale.net/job/centos_x64--104_01--XDCR_upgrade-P1/22/consoleFull

[Test]
./testrunner -i centos_x64--104_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.offline_cluster_upgrade,initial_version=2.5.1-1083-rel,replication_type=xmem,bucket_topology=default:1>2;bucket0:1><2,upgrade_nodes=dest;src,use_encryption_after_upgrade=src;dest

Workaround: I put wait of 120 seconds after upgrade and before changing XDCR seetings. and test passed.

Question: Is it expected behavior after upgrading from 2.5.1-1083 -> 3.0.0, since same test passes with no additional wait from as upgrading from 2.0 -> 3.0 or 2.5.0-3.0?


Issue occuring only for upgrade from 2.5.1-1083-rel -> 3.0.0-973-rel.
 

 Comments   
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
Have you tried waiting (much) less than 2 minutes ?
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
I have not tried it with less than 2 minutes, I will try with 1 minute or lesser and update you about the test result.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Test failed with 1 minute wait, but passed with 90 seconds wait.

Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Alk - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
I've merged diagnostics to help figure this case out: http://review.couchbase.org/40148

Retry with build having that fix




[MB-7432] XDCR Stats enhancements Created: 17/Dec/12  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication, UI
Affects Version/s: 2.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Improvement Priority: Major
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-9218 Incoming XDCR mutations don't match o... Closed

 Description   
After seeing XDCR in action, would like to propose a few enhancements:

-Put certain statistics in the XDCR screen as well as on the graph page:
    -Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
    -Items per second to see speed of each stream and in total
    -Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total

-On the graph page of outgoing, I would recommend removing "mutations checked", "mutations replicated", "data replication", "active vb reps", "waiting vb reps", "secs in replicating", "secs in checkpointing", "checkpoints issued" and "checkpoints failed". These stats really aren't useful from the perspective of someone trying to monitor or troubleshoot the current state of their cluster.
-On the graph page of outbound, there's a bit of confusion over the difference between "mutations to replicate", "mutations in queue" and "queue size". Unless they are showing significantly (and usefully) different metrics, recommend to remove all but one
-On the graph page of incoming, recommend to put "total ops/sec" on the far left to line up with the "ops/sec" in the summary section
-"XDCR dest ops per sec" is confusing because this cluster is the "destination" yet the stat implies the other way around. Recommend "Incoming XDCR ops per sec"
-"XDCR docs to replicate" is a little confusing because it doesn't match the same stat in the "outbound". Recommend to change "mutations to replicate" to "XDCR docs to replicate"
-Would also be good to see outbound ops/sec in the summary section alongside the number remaining to replicate

 Comments   
Comment by Junyi Xie (Inactive) [ 18/Dec/12 ]
Perry,

I will certainly add the stats you suggested, and reorder some stats to make it more readable.


For current stats, they exist for some reasons, actually most of them are there because of request from QE and performance team, although apparently there are not quite interesting to users. If they do not cause big downside, I would like to keep them at this time.
Comment by Perry Krug [ 19/Dec/12 ]
Thanks Junyi. I'd actually like to continue the discussion about removing those stats because anything that a customer sees will generate a question as to the purpose...meaningful or not. We want the UI the be simple and direct to our users for the purpose of understanding what the cluster/node is doing...I don't think these 11 stats help accomplish that for our customers. Additionally, I think the ns_server team would agree that the overall less stats we have the better for performance and maintenance.

To be clear, I'm not advocating for these stats removed from the system completely, just from the UI.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Dipti,

Perry suggested removing some XDCR stats on UI and add some new stats. This is big change in XDCR UI and it woud be better that you are aware of this. Before going ahead and implement this, I would like to have your comments here if

1) Are these new stats necessary?

2) Are these old XDCR stats which Perry suggested to remove, still valid to some customers?

3) Which version do you want this change happens, say 2.0.1 (too late?), 2.1, or 3.0 etc.

Please add others whom you think should be aware of this.

Thanks.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Please see my comments.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Ketaki and Abhinav,

Please also put your feedback about proposal Perry suggested. Thanks.
Comment by Ketaki Gangal [ 10/Jan/13 ]
Adding some more here

- Rate of Replication [items sent / sec]
- Average Replication Rate
- Lag in Replication ( Helpful to understand/observe If receiving too many back-offs/Timeouts)
          - Average Replication lag
- Items replicated
- Items to replicate
- Percentage Conflicts in Data

Other Useful ones
-------------------------------------
-one checkpoint every minute .
-back off handled by ns-server
-how many times retry
-timeouts - failed to replicate
-average replication lag
- XDCR data size
Comment by Ketaki Gangal [ 15/Jan/13 ]
Based on our discussion today, can we have the following changes/edits on the current XDCR stats.

1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total

2. On the Main bucket section
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR docs"
c. Add Percentage Complete
d. Add XDCR Replication Rate

3. On Outgoing XDCR section
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue"
d. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs"
d. Rename "queue size" as "XDCR queue size"
e. Edit "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints.
 
@Perry - Stats "secs in replicating" and "secs in checkpointing" have been useful in triaging xdcr bugs in the past.
Currently most of the xdc stats are aggregate at the ns_server, mnesia level. The individual( @ a vbucket level) logging is maintained at the log level. Considering the criticality of this stat, we ve decided to continue maintaining this information for xdc checkpointing.

Comment by Ketaki Gangal [ 15/Jan/13 ]
Of these , these stats are most critical


1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
Comment by Dipti Borkar [ 16/Jan/13 ]
Ketaki, sorry I couldn't attend the meeting today. I want some clarification on some of these before we implement. I'll sync up with you tomorrow.
Comment by Perry Krug [ 16/Jan/13 ]
Thank you Ketaki.

A few more comments:
-I don't know that "percentage complete" and "XDCR replication rate" is necessarily needed in the "main bucket section"...those are really specific to each stream below and may not make sense to aggregate together.
-Are we planning on keeping "mutation to replicate" and "XDCR docs to replicate" as separate stats?
-Along with above, what is the difference between (and do we need to keep all) "XDCR queue size", and "Outbound XDCR docs"?
-I still question the usefulness of the "secs in replicating" and "secs in checkpointing"...won't these values be constantly incrementing for the life of the replication stream? When looking at a customer's environment after running for days/weeks/months, what are these stats expected to show? Apologies if I'm not understanding them correctly...

Thanks
Comment by Ketaki Gangal [ 16/Jan/13 ]
@Dipti - Sure, lets sync up today on this.

@Perry -
c. Add Percentage Complete - yes, this is more pertinent at a replication stream level
d. Add XDCR Replication Rate - yes, this is more pertinent at a replication stream level

 Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs" , so they should be the same stats.
@Junyi - Correct me if this is a wrong assumption.

XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
Outbound XDCR docs : Is the total items that are to be replicated, not all of them are in-memory at all times.

For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.

Another way to do this would mean adding logging/some information elsewhere, but the current stats @ ns_server/xdcr level show these values on a per-vbucket basis which may/not essentially be very useful while triaging any errors of this kind.
We can however have a call to discuss more ,if there is a better way to implement this.

Comment by Perry Krug [ 16/Jan/13 ]
Thanks for continuing the conversation Ketaki. A few more follow ons from my side:

XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?

For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?

Thanks Ketaki. This is all good information for our documentation and internal information as well.

Perry
Comment by Ketaki Gangal [ 20/Jan/13 ]
Hi Perry,

[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
Xdcr queue size - is the total memory used for the xdcr queue per node. We want to account for memory overhead w/ xdcr(we only store key and metadata.)
This is the memory on the source node. It is accounted in the beam.smp memory.

For each vb replicator:
the queue is created with following limits
maximum number of items in the queue: BatchSize * NumWorkers * 2, by default, the batch size is 500, and NumWorkers is 4, so the queue can hold at most 4000 mutations
maximum size of queue: 100 * 1024 * NumWorkers, by default, it is 400KB
In short, the queue is bounded by 400KB or hold 4000 items, whichever is reached first.

On each node there is max 32 active replicators, so it is 32*400KB = 12800KB = 12.8MB maximum memory overhead used by the queue.

[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
For "secs in replicating" v/s "secs in checkpointing" I am not sure of the exact difference between the two.
@Junyi - Could you explain more here?
I should ve referred the "Docs to replicate" inplace of the "secs checkpointing" which lead to significant checkpoint changes in the past - my bad. This "http://www.couchbase.com/issues/browse/MB-6939" was the one I had in mind while referring to badness.

thanks,
Ketaki
Comment by Junyi Xie (Inactive) [ 21/Jan/13 ]
This bug will spawn a list of fixes. My tentative plan is to resolve this bug by several commits, based on all discussion above.

First of all, let me make clear that the "docs" (or "items") XDCR replicate is actually "mutations", say, suppose we send 10 docs via XDCR to remote cluster, it is possible all these docs are 10 mutations for the single document (item), rather than from 10 different docs(items). So, in the stats section, we should use "mutations", instead of "docs" when applicable.

Here is my summary, please let me know if any question or I miss anything

Commit 1: Rename current stats, just renaming, no change to the underlying stats
In the MAIN bucket section:
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR mutations"
In the Outbound XDCR stats section:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
d. Rename "queue size" as "XDCR queue size"

Commit 2: Change current stats
In the Outbound XDCR stats section:
a. Change "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints, also rename them correspondingly


Commit 3: Add new stats
In the Outbound XDCR stats section:
a. add new stat "Percentage of completeness", which is computed as the
"number of mutations already sent to remote side" / ("number of mutations already sent to remote side" + "number of mutations waiting to be sent to remote side").
Here "number of mutations waiting to be sent to remote side" is the stat "Outbound XDCR mutations"

b.add new stats "Replication rate" which is the number of mutations we sent per second to see speed of each stream. Unit: #ofmutations/per second

c.add new stats "Bandwidth in use", which is defined as the number of bytes, the bandwidth XDCR uses on the fly. Unit: Bytes/per second



Commit 4: remove all uninteresting stats and route them to logs
In Outbound XDCR stats section:
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue", move it to a logging level.



Comment by Perry Krug [ 21/Jan/13 ]
Thanks Junyie.

A couple quick questions/clarifications:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?

d. Rename "queue size" as "XDCR queue size"
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage. Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?

Commit 3: Add new stats
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.

Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
Comment by Junyi Xie (Inactive) [ 21/Jan/13 ]
Perry,

[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.

[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.

[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.

Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?

[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?



[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?

First, both are aggregated elapsed time from each vb replicator.

"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.

By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.

For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.


Comment by Perry Krug [ 21/Jan/13 ]
Thanks so much Junyie.

Perry,

[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Perfect, thank you.

[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk] - Yes, that will be great, thanks.

[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.

Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - This was the recommendation from QE based upon some analysis we did at Concur. Would be *extremely* helpful to get accurate and specific sizing information, and what takes up that size in whatever form.

[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.


[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?

First, both are aggregated elapsed time from each vb replicator.

"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.

By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.

For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Thanks for the explanation. I would still advocate for removing them. The main reason being that they do not materially help identify any issue or behavior after the cluster has been running for an extended period of time. The up-to-the-second monitoring of these stats will show an extremely high number for both after just a few days or a week of a replication stream running...let alone multiple weeks or months. I can definitely see that they would be useful when debugging the initial stream or trying to identify an issue, but I would ask that they be moved to the log or other stat area outside of the UI.

Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?

Thanks again!
Comment by Junyi Xie (Inactive) [ 22/Jan/13 ]
Perry, you are highly welcome. Please see my response below.

[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[jx] -- I see. Thanks for explanation. I agree from user perspective, it is better to have summary stat of ALL replications, not just per-replication stream.
Today seems we do not have anything like this (stats across all buckets)?, there is no stat at XDCR tab either, so I need to talk to UI guys how to add these stats and where to add them. It involves some UI design change and more than adding another per-replication stat on UI. Better to


[pk] -- Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
[jx] -- Other than UI stats, XDCR also dumped a lot of stats and information to log files, but I am afraid they are too detailed and hard to parse from customers perspective :) Today I put all XDCR stats on UI. Tomorrow, after we remove some stats on UI (like secs in checkpointing), I will put them into log and document how to get them easily. For all stats on UI, you could use standard REST API to get them.

Comment by Junyi Xie (Inactive) [ 24/Jan/13 ]
Thanks everybody for discussion. I will start working on Commit 1 and 2, which we all agree on.
Comment by Junyi Xie (Inactive) [ 24/Jan/13 ]
Commit 1: http://review.couchbase.org/#/c/24189/

Comment by Junyi Xie (Inactive) [ 28/Jan/13 ]
Commit 2: http://review.couchbase.org/#/c/24251/
Comment by Junyi Xie (Inactive) [ 12/Feb/13 ]
Commit 3: add latency stats

http://review.couchbase.org/#/c/24399/
Comment by Perry Krug [ 14/Feb/13 ]
Thanks Junyi. Do we have a bug open already for the UI enhancements around this?
Comment by Junyi Xie (Inactive) [ 14/Feb/13 ]
I mean you can open another bug for the bandwidth usage, which is purely a UI work, nothing to do with XDCR code.

For this particular bug MB-7432, all work on XDCR side is done except the stats removal (Dipti will make decision for that, probably she will file another bug). So please close this bug if you do not need any thing from me.
Comment by Perry Krug [ 14/Feb/13 ]
So it sounds like this is not yet resolved if all the decisions haven't been made yet.

Assigning to Dipti to make the final decisions...I want to leave it open to make sure things get wrapped up.

Adding a UI component for the bandwidth request.
Comment by Maria McDuff (Inactive) [ 25/Mar/13 ]
deferred out of 2.0.2
Comment by Perry Krug [ 22/Oct/13 ]
Can we revisit this for 2.5 or 3.0? All that's remaining is a decision to be made on removing some of the statistics. I still feel many of them are misleading or confusing to the user and should be moved to something more internal if still needed by our dev/QE
Comment by Junyi Xie (Inactive) [ 22/Oct/13 ]
Anil,

I think it would be nice if you can call a meeting with Perry and me to discuss which XDCR stats should be removed. We do not want remove stats which are still useful, or remove them but have to re-add them later.
Comment by Perry Krug [ 24/Oct/13 ]
Just adding the comment from 9218:

Putting "outbound XDCR mutations" on one side and "incoming XDCR mutations" on the other side makes the two seem very related. Perhaps "outbound XDCR mutations" should be "XDCR backlog" to make it clearer that it is not a rate and should not match the number on the other side.
Comment by Perry Krug [ 29/Oct/13 ]
To summarize the conversation and provide next steps:

My primary goal here is to provide meaningful and "actionable" statistics to our customers. I recognize that there may be various other stats that are useful for testing and development, but not necessarily for the end customer. The determining factor in my mind is whether we can explain "what to do" when a particular number is high or low. If we do not have that, then I suggest the statistic does not need to be displayed in the UI. Much the same way we do not expose the 300 statistics available with cbstats, I think the same logic should be applied here.

So...my requests are:
-Change "outbound XDCR mutations" to "XDCR backlog" to indicate that this is the number of mutations within the source cluster that have not yet been replicated to the destination. This stat is shown both in the "summary" as well as the per-stream "outbound xdcr operations" sections
-Change "mutations replicated optimistically" from an incrementing counter to a "per second" rate
-Remove from "outbound xdcr operations" sections:
   -mutations checked*
   -mutations replicated*
   -data replicated*
   -active vb reps±
   -waiting vb reps±
   -secs in replicating*
   -secs in checkpointing*
   -checkpoints issued±
   -checkpoints failed±
   -mutations in queue~
   -XDCR queue size~

To provide some more explanation:
(*) - These stats are constantly incrementing and therefore after weeks/months of time are not useful to describing any behavior or problem
(±) - These stats are internal implementation details, and also do not signal to the user that they should take specific action
(~) - These stats are "bounded parameters". Therefore they should never be higher than what the parameter is set to. Even if they are higher or lower, we don't have a recommendation on "what to do" back to the customer


The stats I am suggesting to remove should still be available via the REST API, but I think they are not as useful in the UI. In the field, we sometimes need to explain not only what each stat means, but "what to do" based upon the value of these statistics. I don't feel that these statistics represent something the customer needs to be concerned about nor action on.
Comment by Cihan Biyikoglu [ 11/Mar/14 ]
We will consider the feedback but UPR work has priority and we are the long pole for the release. moving to backlog. assigning to myself.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
[Minor edit on ordering...new layout below]

Discussed stats layout with Anil and Perry yesterday. Below is Anil's capture of result that conversation. I may add some more stats to this however (I'm thinking about %utilization that might be quite useful and easily doable).

Hi Alk,

Here is what we discussed on XDCR stats -
First row
Outbound XDCR mutations
Percent completed
Active vb reps
Waiting vb reps

Second row
Mutation replication rate
Data replication rate
Mutation replicated optimistically rate
Mutations checked rate

Third row
Meta ops latency
Doc ops latency
New stats
New stats

Thanks!
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
http://review.couchbase.org/#/c/40094/

Will need some more naming/placing advice but at least it works nicely now.




[MB-11143] Avg. BgFetcher wait time is still 3-4 times higher on a single HDD in 3.0 Created: 16/May/14  Updated: 31/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = HDD

Issue Links:
Dependency
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: 2.5.1:
http://ci.sc.couchbase.com/job/thor-64/637/artifact/

3.0.0-680:
http://ci.sc.couchbase.com/job/thor-64/643/artifact/
Is this a Regression?: Yes

 Description   
Read-heavy KV workload:
-- 4 nodes
-- 1 bucket x 200M x 1KB (20-30% resident)
-- 2K ops/sec
-- 1 / 80 / 18 / 1 C/R/U/D
-- 15-20% cache miss rate (most will miss page cache as well)

Average latency is still 5-6 times higher in 3.0. Histograms:

https://gist.github.com/pavel-paulau/e9b8ab4d75b9a662ff07

 Comments   
Comment by Pavel Paulau [ 16/May/14 ]
I know you made several fixes.

But at least this workload still looks bad.

In the meanwhile, I will test SSD (both cheap and expensive).
Comment by Chiyoung Seo [ 16/May/14 ]
Thanks Pavel for identifying this issue.

Sundar and I found recently that there are still some issue in scheduling the global threads.

Sundar,

Please take a look at this issue too. Thanks!
Comment by Sundar Sridharan [ 16/May/14 ]
Hi Pavel,
did you see a recent regression within 3.0 or is it from 2.5 only?
thanks
Comment by Pavel Paulau [ 16/May/14 ]
From 2.5 only.
Comment by Chiyoung Seo [ 16/May/14 ]
Sundar,

We saw this regression in 3.0 compared with 2.5
Comment by Sundar Sridharan [ 16/May/14 ]
Just to help narrow down the cause, is the increased latency seen in 3.0 with just 2 shards as opposed to 4? thanks
Comment by Chiyoung Seo [ 16/May/14 ]
The machine has 24 cores. I think Pavel used the default settings of our global thread pool.
Comment by Pavel Paulau [ 17/May/14 ]
On cheap SSD drives it's only ~50% percents slower.
Comment by Pavel Paulau [ 19/May/14 ]
On fast drives (RAID 10 SSD) it looks a little bit better.

It means that issue applies only to cheap saturated devices. Not unexpected.
Comment by Chiyoung Seo [ 19/May/14 ]
Pavel,

Now, I remembered that we had some performance regression in bg fetch requests when the disk is slow HDD (e.g., single volume) or commodity SSD drives. Can you test it with more advanced HDD settings like RAID 10?

The PM team mentioned that most of our customers use more and more advanced HDD drives or enterprise-version SSD or Amazon EC2 SSD instance.
Comment by Pavel Paulau [ 19/May/14 ]
It's 4x slower with 2 shards (Sundar requested this benchmark).

Le'ts discuss the problem once we get setups with RAID 10 HDD (CBIT-1158).
Comment by Pavel Paulau [ 21/Jun/14 ]
I was finally able to run the same tests on SSD and RAID 10 HDD. Everything looks good, there is no regression.

We also replaced slow single HDD disks with faster ones. On average 3.0.0 builds look 3-4 times slower. Comparison of histograms is below:

2.5.1-1083 (logs - http://ci.sc.couchbase.com/job/leto/125/artifact/):

 bg_wait (338200 total)
    4us - 8us : ( 0.01%) 46
    8us - 16us : ( 0.35%) 1128
    16us - 32us : ( 7.14%) 22988 ##
    32us - 64us : ( 18.33%) 37816 ####
    64us - 128us : ( 31.60%) 44909 #####
    128us - 256us : ( 32.25%) 2193
    256us - 512us : ( 32.49%) 805
    512us - 1ms : ( 32.79%) 1016
    1ms - 2ms : ( 33.95%) 3916
    2ms - 4ms : ( 37.52%) 12089 #
    4ms - 8ms : ( 46.04%) 28795 ###
    8ms - 16ms : ( 58.17%) 41044 #####
    16ms - 32ms : ( 71.92%) 46474 #####
    32ms - 65ms : ( 84.73%) 43333 #####
    65ms - 131ms : ( 95.06%) 34931 ####
    131ms - 262ms : ( 99.52%) 15109 #
    262ms - 524ms : ( 99.99%) 1584
    524ms - 1s : (100.00%) 24

3.0.0-849 (logs - http://ci.sc.couchbase.com/job/leto/129/artifact/):

 bg_wait (339115 total)
    4us - 8us : ( 0.00%) 6
    8us - 16us : ( 0.03%) 101
    16us - 32us : ( 3.36%) 11291 #
    32us - 64us : ( 20.82%) 59206 #######
    64us - 128us : ( 39.39%) 62969 #######
    128us - 256us : ( 40.56%) 3984
    256us - 512us : ( 40.76%) 681
    512us - 1ms : ( 40.98%) 722
    1ms - 2ms : ( 41.89%) 3087
    2ms - 4ms : ( 43.73%) 6236
    4ms - 8ms : ( 47.58%) 13064 #
    8ms - 16ms : ( 53.79%) 21074 ##
    16ms - 32ms : ( 63.65%) 33410 ####
    32ms - 65ms : ( 75.12%) 38904 ####
    65ms - 131ms : ( 85.72%) 35968 ####
    131ms - 262ms : ( 93.55%) 26550 ###
    262ms - 524ms : ( 97.93%) 14844 #
    524ms - 1s : ( 99.38%) 4904
    1s - 2s : ( 99.81%) 1486
    2s - 4s : ( 99.96%) 509
    4s - 8s : (100.00%) 119
Comment by Pavel Paulau [ 21/Jun/14 ]
It's too late to improve characteristics in 3.0 but I strongly recommend to keep it open and consider possible optimization later.
Comment by Sundar Sridharan [ 23/Jun/14 ]
It may not be too late yet, bg fetch latencies are very important esp with full eviction. The root cause could be either due to increased number of writing threads or just a scheduling issue. If it is the former, the fix may be nontrivial as you suggest, however, if it is the latter, then it should be easy to fix. hope to have a fix for this soon.
Comment by Chiyoung Seo [ 24/Jun/14 ]
Moving this to post 3.0 because we observed lower latency on SSD and RAID HDD environments, but only saw the performance regressions on a single HDD.
Comment by Cihan Biyikoglu [ 25/Jun/14 ]
Agreed. single HDD is a dev scenario and not perf critical.
Comment by Sundar Sridharan [ 31/Jul/14 ]
Pavel,
fix: http://review.couchbase.org/#/c/40080/ and
fix: http://review.couchbase.org/#/c/40084/
is expected to improve bgfetch latencies significantly.
Could you please investigate this with heavy-DGM scenarios?
thanks




[MB-11042] cbstats usage output not clear Created: 05/May/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
This is the output of cbstats without any arguments on 2.5.1

====================
Usage: cbstats [options]

Options:
  -h, --help show this help message and exit
  -a iterate over all buckets (requires admin u/p)
  -b BUCKETNAME the bucket to get stats from (Default: default)
  -p PASSWORD the password for the bucket if one exists
Usage: cbstats host:port all
  or cbstats host:port allocator
......
====================

As a new user this isn't clear to me, which arguments are required and which are optional ? These are usually denoted as a bracket [] for optional arguments. See: http://courses.cms.caltech.edu/cs11/material/general/usage.html

Also, it's not made clear in this output that the host:port option is looking for the dataport (usually 11210).

 Comments   
Comment by Anil Kumar [ 04/Jun/14 ]
cbstats belongs to ep_engine.

Triage - June 04 2014 Bin, Ashivinder, Venu, Tony, Anil
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Sundar - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.
Comment by Sundar Sridharan [ 30/Jul/14 ]
This is not a big change and low impact, so i guess i will try to fix it today itself if possible.
Comment by Sundar Sridharan [ 30/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/40069 thanks




[MB-11645] Make cbepctl parameter options consistent compared with the actual list of parameters that we support in 3.0 Created: 03/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Chiyoung Seo Assignee: Abhinav Dangeti
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There are some mismatches between cbepctl paramter options and the list of configurable engine parameters that we support in 3.0.

We need to make them consistent, so that the couchbase documentation (http://docs.couchbase.com/couchbase-manual-2.5/cb-cli/index.html#cbepctl-tool) should be consistent too.


 Comments   
Comment by Mike Wiederhold [ 17/Jul/14 ]
The parameters we support and the cbepctl script are up to date. I think this should be assigned to the docs team to update the documentation page.
Comment by Mike Wiederhold [ 22/Jul/14 ]
I just double checked this while waiting for some tests to finish running. Everything looks up to date.




[MB-11713] DCP logging needs to be improved for view engine and xdcr Created: 11/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After looking at one of the system tests I found the following number of log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | wc -l
1061224 // Total log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep xdcr | wc -l
1033792 // XDCR related upr log messages

[root@soursop-s11205 ~]# cat /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep -v xdcr | grep UPR | wc -l
3730 // Rebalance related UPR messages

In this case 97% of all log messages are for XDCR UPR streams.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Mike Wiederhold [ 29/Jul/14 ]
Marking as won't fix for now. The indexer and view engine should change they way they use upr in a future release and I cannot think of a better way to reduce the log messages based on their usage.




[MB-11332] APPEND/PREPEND returning KEY_NOT_FOUND instead of NOT_STORED Created: 05/Jun/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Brett Lawson Assignee: Venu Uppalapati
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
When performing an APPEND or PREPEND operation when the key does not exist in pre-3.0 servers, NOT_STORED will be the returned error code. However, as of 3.0, it appears the server has begun responding with KEY_NOT_FOUND. This is a more logical error, but has implications to end-users.

 Comments   
Comment by Matt Ingenthron [ 05/Jun/14 ]
marking as a blocker as it's an unintentional interface change and breaks existing apps
Comment by Anil Kumar [ 17/Jun/14 ]
Trond can you please take look since Chiyoung is OOF.
Comment by Trond Norbye [ 17/Jun/14 ]
This was explicitly done as a change from MB-10778

This is one of the situations where we have to decide what we're going to do moving forward. Should we use unique error codes to allow the app to know what happened, or should we stick with backwards compatibility "forever". I would say that moving to a new major release like 3.0 would be a good time to add more specific error codes and properly document it. Previously you just knew that "something failed". It could be that the key didn't exist, or that we encountered an error storing the new key.

I can easily add back those lines, but that would also affect the bug report mentioned. Bigger question is: when should we start fixing our technical debt?

Let me know what you want me to do.
Comment by Trond Norbye [ 17/Jun/14 ]
Btw:

commit 869a66d1d08531af65169c59b640de4546974a34
Author: Sriram Ganesan <sriram@couchbase.com>
Date: Fri Apr 11 13:46:16 2014 -0700

    MB-10778: Return item not found instead of not stored

    When an application tried to append to an item that doesn't exist,
    ep-engine needs to return not found as opposed to not stored

    Change-Id: Ic4e50b069e41028cd879530a183d3ac43a3ebc1c
    Reviewed-on: http://review.couchbase.org/35619
    Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
    Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Comment by Anil Kumar [ 19/Jun/14 ]
Please check Tronds reponses
Comment by Matt Ingenthron [ 03/Jul/14 ]
I understand Trond's response, but I don't think I can speak to this. It's a reasonable thing to change responses between 2.x and 3.0, but is PM's expectation for users who upgrade that they will need to make (albeit minor) changes to their applications?

I think that's the decision that needs to be made. Then the decision if this is a bug or not can be made.

I'm good with either and there is zero client impact in either case. The impact is on the part of end users. They'll need to potentially change the error handling in applications before upgrade.

If we do want to make this kind of change, the time to do it is when doing a major version change.
Comment by Trond Norbye [ 03/Jul/14 ]
Part of the problem here is that we adapted the "error codes" from the old memcached environment when we implemented this originally. I do feel that we should try to make "better" error codes in order to make the life easier for the end user. The user would have to extend their use to try to use "add" when key_not_found is returned in addition to "not stored" to handle both cluster version. If we think adding the error is the wrong thing to do I'm happy reverting the change.
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
I assume there would be quite a few apps handling this error. I don't think we can take this change.

To be bale to take a change like this we need a backward compat flag. The flag would help admins modify the behavior of the server. For example the flag could be set to be compliant with 2.5.X vs 3.X on a new version of the server and if it is set to 2.5.x, we continue to throw NOTSTORED but if the behavior is set to 3.x we can throw KEYNOTFOUND. Until we get this we cannot take breaking changes gracefully.
Comment by Trond Norbye [ 07/Jul/14 ]
It may "easily" be handled inside the client with such a property (just toggling the error code back to the generic not_stored return value). I'd rather not do it on the server (since it would introduce a new command you would have to send to the server in order to set the "compatibility" level for the connection)
Comment by Cihan Biyikoglu [ 07/Jul/14 ]
We may need both. A future facing facility like this can ensure we get a general purpose setting that can hide backward compat issues for all components like N1QL and Indexes, Views and more. There will be a whole lot more than error codes in future.
Comment by Brett Lawson [ 07/Jul/14 ]
I'd like to add, that currently the Node.js and libcouchbase clients currently map both errors to a single error code, such that testing for KEY_NOT_FOUND or NOT_STORED both work (though this could cause compiler errors with switch statements handling said errors).
Comment by Cihan Biyikoglu [ 16/Jul/14 ]
I don't think this should be in the blocker list. pls raise your voice if you disagree. downgrading to critical.
Comment by Cihan Biyikoglu [ 17/Jul/14 ]
Brett, I take what you said as "there is no backward compat problem for anyone handling the exception in an existing app today if we make the change". Is that right?
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - July 17th

We decided to "revert" the changes since would break many customer apps. Also this is minor change.

Assigning this to Trond to make this change.
Comment by Trond Norbye [ 22/Jul/14 ]
http://review.couchbase.org/#/c/39665/

Are we really sure this is what we want to do and not add in the release notes that people should check the extra return code? (personally I would say that a major release would be the right place to do such a thing...)
Comment by Chiyoung Seo [ 22/Jul/14 ]
I understand the backward compatibility issues. However, we should fix this kind of issues in the major releases and put it in the release note. Otherwise, we won't be able to fix them at all.

As the decision was made, the revert commit was merged.




[MB-11821] Rename UPR to DCP in stats and loggings Created: 25/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sundar Sridharan Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
ep-engine side changes are http://review.couchbase.org/#/c/39898/ thanks
Comment by Mike Wiederhold [ 30/Jul/14 ]
Assigning to myself since I'm coordinating the merge.




[MB-11237] DCP stats should report #errors under the web console Created: 28/May/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
we should consider adding in a error/sec counter to the stats category under the web console. the counter would report the # errors being seen through the DCP protocol to indicate issues with communication among nodes

 Comments   
Comment by Anil Kumar [ 10/Jun/14 ]
Triage - June 10 2014 Anil
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.
Comment by Mike Wiederhold [ 30/Jul/14 ]
http://review.couchbase.org/#/c/40061/
Comment by Mike Wiederhold [ 30/Jul/14 ]
Alk,

We need to add this new stat to the web console. We can just call the stat "backoffs" for now.

Mike-Wiederholds-MacBook-Pro:ep-engine mikewied$ management/cbstats 127.0.0.1:12000 upragg
 :total:backoff: 21
 :total:count: 2
 :total:items_remaining: 46145
 :total:items_sent: 40561
 :total:producer_count: 1
 :total:total_backlog_size: 0
 :total:total_bytes: 44367044
 replication:backoff: 21 <-------------
 replication:count: 2
 replication:items_remaining: 46145
 replication:items_sent: 40561
 replication:producer_count: 1
 replication:total_backlog_size: 0
 replication:total_bytes: 44367044

Comment by Aleksey Kondratenko [ 30/Jul/14 ]
will do this after upr -> dcp renaming is complete. I.e. to avoid extra work of dealing with git conflicts.
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
http://review.couchbase.org/40125




[MB-11801] It takes almost 2x more time to rebalance 10 empty buckets Created: 23/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-881

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File reb_empty.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/400/artifact/
Is this a Regression?: Yes

 Description   
Rebalance-in, 3 -> 4, 10 empty buckets

There was only one change:
http://review.couchbase.org/#/c/34501/

 Comments   
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

Raising this issue to "Critical" this needs to be fixed by RC.
Comment by Mike Wiederhold [ 30/Jul/14 ]
This is caused by the sequence number persistence command taking up to 4 seconds in some cases. I'll post a fix shortly.
Comment by Mike Wiederhold [ 30/Jul/14 ]
http://review.couchbase.org/#/c/40063/
Comment by Pavel Paulau [ 31/Jul/14 ]
The fix helps. Thanks.




[MB-11748] Heap block overrun in _btree_init_node (crashes under GuardMalloc) Created: 16/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jens Alfke Assignee: Chiyoung Seo
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
Now that I've gotten past the previous GuardMalloc issue, my benchmark is crashing during fdb_end_transaction. The cause is that _btree_init_node is calling memcpy such that it writes past the end of a heap block. This is pretty nasty since it is likely to silently cause heap corruption in normal operation.

The call is
        memcpy((uint8_t *)node_addr + sizeof(struct bnode) + sizeof(metasize_t),
               meta->data, meta->size);
node_addr = 0x4b387df80
The first param to memcpy (the destination) is 0x4b387df92
meta->size (the number of bytes to copy) is 0x7b
So the byte range that will be written to is [0x4b387df92, 0x4b387e00d).
The crash occurs at address 0x4b387e000, so apparently the heap block ends somewhere before that.

* thread #1: tid = 0x1ce7bc, 0x0000000102832c7a libsystem_platform.dylib`_platform_memmove$VARIANT$Unknown + 378, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x4b387e000)
    frame #0: 0x0000000102832c7a libsystem_platform.dylib`_platform_memmove$VARIANT$Unknown + 378
  * frame #1: 0x000000010016d43f HeadlessBee`_btree_init_node(btree=0x00000013605a2fa0, bid=17732923532773849, addr=0x00000004b387df80, flag='\x03', level=1, meta=0x00007fff5fbfa3e0) + 255 at btree.cc:202
    frame #2: 0x0000000100167e70 HeadlessBee`btree_init(btree=0x00000013605a2fa0, blk_handle=0x00000003554dafa0, blk_ops=0x000000010023dbd0, kv_ops=0x00000003554eff80, nodesize=4096, ksize='\b', vsize='\b', flag='\0', meta=0x00007fff5fbfa3e0) + 336 at btree.cc:375
    frame #3: 0x000000010014ce9e HeadlessBee`hbtrie_insert(trie=0x00000003554edfb0, rawkey=0x0000000e0424ef70, rawkeylen=144, value=0x00007fff5fbfa508, oldvalue_out=0x00007fff5fbfa510) + 5374 at hbtrie.cc:1347
    frame #4: 0x0000000100162155 HeadlessBee`_fdb_wal_flush_func(voidhandle=0x00000003554aaf00, item=0x0000000e04250f80) + 405 at forestdb.cc:1205
    frame #5: 0x00000001001506ad HeadlessBee`_wal_flush(file=0x00000003554acf30, dbhandle=0x00000003554aaf00, flush_func=0x0000000100161fc0, get_old_offset=0x0000000100162740, by_compactor=false)(void*, wal_item*), unsigned long long (*)(void*, wal_item*), bool) + 541 at wal.cc:575
    frame #6: 0x00000001001508e3 HeadlessBee`wal_flush(file=0x00000003554acf30, dbhandle=0x00000003554aaf00, flush_func=0x0000000100161fc0, get_old_offset=0x0000000100162740) + 51 at wal.cc:612
    frame #7: 0x000000010015d5bb HeadlessBee`fdb_commit(handle=0x00000003554aaf00, opt='\0') + 651 at forestdb.cc:2247
    frame #8: 0x0000000100145637 HeadlessBee`fdb_end_transaction(handle=0x00000003554aaf00, opt='\0') + 103 at transaction.cc:142
...

 Comments   
Comment by Jens Alfke [ 19/Jul/14 ]
This crash is still occurring even with the long-key support (commit 21dda1b).
The addresses are different, of course, but it's still trying to write to 14 bytes into unmapped memory.
Comment by Jung-Sang Ahn [ 21/Jul/14 ]
Jens,
Would you please upload the new stack backtrace as a comment?
And is the meta->size 0x7b same as before?
Thanks.
Comment by Jung-Sang Ahn [ 21/Jul/14 ]
And also please share min_nodesize value in btree_init(). Thanks.
Comment by Jens Alfke [ 22/Jul/14 ]
I can't reproduce the crash anymore. And all I've changed is updating to the latest forestdb commit (23946ef), which only changed a test file, so that can't be related.

The only explanation I have is that my forestdb checkout had gotten stuck on an older commit without my noticing. It's a submodule of a submodule of my main repo, and I switch the main repo back and forth between the master and forestdb-based branch a lot. But I thought I was careful to make sure everything was on the right revision...

Anyway, I'll close this.




[MB-11747] Divide-by-zero crash in _get_nth_idx during fdb_commit (repeatable) Created: 16/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jens Alfke Assignee: Chiyoung Seo
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
My benchmark is repeatably crashing during a fdb_end_transaction call, with a divide-by-zero exception.
(I'm on the current head of the master branch, bc3885d.)

* thread #1: tid = 0x1bc81d, 0x000000010015bd54 HeadlessBee`_get_nth_idx(node=0x000000010f096c00, num=0, den=0, idx=0x00007fff5fbf8ae0) + 52 at btree_kv.cc:127, queue = 'com.apple.main-thread', stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0)
  * frame #0: 0x000000010015bd54 HeadlessBee`_get_nth_idx(node=0x000000010f096c00, num=0, den=0, idx=0x00007fff5fbf8ae0) + 52 at btree_kv.cc:127
    frame #1: 0x000000010016bf5f HeadlessBee`_btree_split_node(btree=0x00000001129d1580, key=0x00007fff5fbfa1e8, node=0x00007fff5fbf8cd0, bid=0x00007fff5fbf8d40, idx=0x00007fff5fbf8d50, i=0, kv_ins_list=0x00007fff5fbf8cf0, nsplitnode=0, k=0x00007fff5fbf8d70, v=0x00007fff5fbf8d60, modified=0x00007fff5fbf8d30, minkey_replace=0x00007fff5fbf8d00, ins=0x00007fff5fbf8d10) + 671 at btree.cc:715
    frame #2: 0x000000010016da7d HeadlessBee`btree_insert(btree=0x00000001129d1580, key=0x00007fff5fbfa1e8, value=0x00007fff5fbfa400) + 3501 at btree.cc:1015
    frame #3: 0x000000010014efd4 HeadlessBee`hbtrie_insert(trie=0x00000001127b4650, rawkey=0x0000000113af6710, rawkeylen=299, value=0x00007fff5fbfa568, oldvalue_out=0x00007fff5fbfa570) + 6084 at hbtrie.cc:1386
    frame #4: 0x00000001001646c5 HeadlessBee`_fdb_wal_flush_func(voidhandle=0x00000001127960a0, item=0x0000000113af91e0) + 405 at forestdb.cc:1197
    frame #5: 0x00000001001526cd HeadlessBee`_wal_flush(file=0x00000001127cb010, dbhandle=0x00000001127960a0, flush_func=0x0000000100164530, get_old_offset=0x0000000100164cb0, by_compactor=false)(void*, wal_item*), unsigned long long (*)(void*, wal_item*), bool) + 541 at wal.cc:575
    frame #6: 0x0000000100152903 HeadlessBee`wal_flush(file=0x00000001127cb010, dbhandle=0x00000001127960a0, flush_func=0x0000000100164530, get_old_offset=0x0000000100164cb0) + 51 at wal.cc:612
    frame #7: 0x000000010015f99b HeadlessBee`fdb_commit(handle=0x00000001127960a0, opt='\0') + 651 at forestdb.cc:2239
    frame #8: 0x0000000100147467 HeadlessBee`fdb_end_transaction(handle=0x00000001127960a0, opt='\0') + 103 at transaction.cc:142


 Comments   
Comment by Jens Alfke [ 16/Jul/14 ]
I looked into the direct cause of the crash: why is den==0?
This parameter originally comes from the return value of _btree_get_nsplitnode(), so I added an assertion that the return value (nnode) is greater than zero.
When I hit the assertion failure, I saw that the value of dataspace is a bogus huge number, so dividing by it leads to a 0 result.
dataspace is bogus because it's calculated as
    dataspace = nodesize - headersize;
and nodesize=4095 but headersize=16208. The result is negative of course.

node->flag is 0x4b, meaning that the 'if' statement above went through the first branch.
That's about all I can figure out :-)
Comment by Jens Alfke [ 16/Jul/14 ]
Also possibly relevant: Shortly before this happens I'm seeing some warnings in my code — it's deleting a large number of documents by their sequence IDs, and two of the sequence IDs aren't found (fdb_get_metaonly_byseq can't find the document.) That shouldn't be happening since I know all the sequence IDs are valid, so it makes me suspect there are already problems in the by-sequence btree.
Comment by Jens Alfke [ 19/Jul/14 ]
Not seeing this anymore since the long-key support was checked in (commit 4b3969f92). However, I still can't get all the way through my benchmark code due to hitting MB-11748, so I don't know for sure that this is gone.
Comment by Jens Alfke [ 22/Jul/14 ]
Benchmark completes successfully now. Closing this.




[MB-11778] upr replica is unable to detect death of upr producer (was: Some replica items not deleted) Created: 21/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS 6.x

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://172.23.106.47:8091/index.html
http://172.23.106.45:8091/index.html

https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar
Is this a Regression?: Unknown

 Description   
I'm seeing a bug similar to MB-11573 on 991. 600 replica items haven't been deleted. However curr_items and vb_active_curr_items are correct.


2014-07-21 18:18:44 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
2014-07-21 18:18:48 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-07-21 18:18:49 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket

testcase:
./testrunner -i sanity.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,reboot=dest_node,items=2000,rdirection=bidirection,replication_type=xmem,standard_buckets=1,sasl_buckets=1,pause=source-destination,doc-ops=update-delete,doc-ops-dest=update-delete

What the test does:

3nodes * 3nodes, bi-dir xdcr on 3 buckets
1. Load 2k items on both clusters. Pause all xdcr(all items got replicated by this time)
2. Reboot one dest node (.48)
3. After warmup, resume replication on all buckets, on both clusters
4. 30% Update, 30% delete items on both sides. No expiration set.
5. Verify item count , value and rev-ids.


The cluster is available for debugging until tomorrow morning. Thanks.

 Comments   
Comment by Chiyoung Seo [ 21/Jul/14 ]
Mike,

Can you please look at this issue? The live cluster is available now.

Seems like the deletions are not replicated.
Comment by Mike Wiederhold [ 22/Jul/14 ]
The cluster looks fine right now so the problem seemed to work itself out. In the future please run one of the scripts we have to figure out which vbuckets are mismatched in the cluster. This will greatly reduce the amount of time needed to look through the cbcollectinfo logs.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
It was different issue, removed my comment.
Comment by Mike Wiederhold [ 22/Jul/14 ]
Alk,

In the memcached logs it looks like at the time that this bug was reported there were missing items. Then about 2 hours I see ns_server create a bunch of replication streams and all of the items that were "missing" are no longer actually missing. Can you take a look at this from the ns_server side and see why it took so long to create the replication streams?

Also, note that as of right now there is only a live cluster and no cbcollectinfo on the ticket.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Which cluster I need to look at Mike ?
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://172.23.106.47:8091/index.html (This is the one that had the problem. Node .47 in particular)
http://172.23.106.45:8091/index.html
Comment by Aruna Piravi [ 22/Jul/14 ]
Pls note cbcollect has already been attached under link to logs section - https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar , along with the cluster IPs.
Comment by Aruna Piravi [ 22/Jul/14 ]
And cbcollectinfo was grabbed at the time replica items were incorrect.

Just curious, only replica items were incorrect. Active vb items on both clusters were correct. Does this still have to do with xdcr?
Comment by Aruna Piravi [ 22/Jul/14 ]
ok, I think Mike meant the intra-cluster replication streams.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
I'm not seeing upr replicators spotting this shutdown at all. And I've just verified that I kill -9 memcached, erlang's replicator correctly detects connection closure and reestablishes connections.

Will now test with VMs and reboot.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Confirming manually by doing "hard reset" of VM and observing that other VM does not re-establish upr connections after resetted VM is rebooted.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
It is "it".
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://review.couchbase.org/#/c/39683/
Comment by Aruna Piravi [ 24/Jul/14 ]
Verified on 1014. Closing this issue, thanks.




[MB-11720] Backfilling the entire vbucket can starve other streams that also need to backfill Created: 14/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Mike Wiederhold Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates MB-11602 KV+XDCR System test : Rebalance gets ... Closed
Relates to
relates to MB-11714 3x regression in XDCR replication lat... Closed
Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Mike Wiederhold [ 14/Jul/14 ]
Making this a blocker until we make a decision on how we want to handle this.
Comment by Sundar Sridharan [ 14/Jul/14 ]
Increasing the limit of the number of threads that are allowed to work on the AUXIO Q type from 10% to 30%.
This means in systems having less than 20 threads, upto 3 threads can be doing backfilling in parallel.
http://review.couchbase.org/39364
thanks
Comment by Chiyoung Seo [ 15/Jul/14 ]
Mike,

I'm not sure if this ticket is related to MB-11714.

As Pavel posted, MB-11714 was a regression from one of the changes between 3.0.0-918 and 3.0.0-919:

http://builder.hq.couchbase.com/#/compare/couchbase-server-enterprise_centos6_x86_64_3.0.0-918-rel.rpm/couchbase-server-enterprise_centos6_x86_64_3.0.0-919-rel.rpm
Comment by Pavel Paulau [ 15/Jul/14 ]
Alk mentioned that MB-11597 "Might be related to MB-11714."

MB-11597 was closed as duplicate of MB-11720. That's it.
Comment by Sundar Sridharan [ 16/Jul/14 ]
Fix to do make threads poll both priority queues before every sleep was merged as part of http://review.couchbase.org/#/c/39210/. This may help the backfilling starvation seen. Please help reopen this ticket if the backfill issue in MB-11602 resurfaces. thanks
Comment by Chiyoung Seo [ 23/Jul/14 ]
In MB-11602, Aruna confirmed that she didn't see any more timeout issues in the tests.




[MB-11827] {UPR} :: Rebalance stuck with rebalance-out due to indexing stuck Created: 26/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-10.6.2.147

Triage: Untriaged
Is this a Regression?: Yes

 Description   
1033, centos 6x

1. Create 4 node cluster
2. Create default bucket
3. Add 1 K items
4. Create 3 views and start querying
5. Rebalance-out 1 node

Step 4 and Step 5 act in parallel

Rebalance is stuck

looked at couchdb log, found the following error across different machines in cluster

[couchdb:error,2014-07-26T18:47:27.450,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16158.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.557,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16199.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.745,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16230.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.783,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16240.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.831,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16250.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.877,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16260.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

TEST CASE ::
./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_rebalance_out,nodes_init=4,nodes_out=1,GROUP=P0,skip_cleanup=True,blob_generator=false

 Comments   
Comment by Parag Agarwal [ 26/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11827/1033log.tar.gz

Comment by Parag Agarwal [ 26/Jul/14 ]
Test Case failing:: http://qa.hq.northscale.net/view/3.0.0/job/centos_x64--02_05--Rebalance_Progress/

Check the first 6, rebalance hangs
Comment by Aleksey Kondratenko [ 26/Jul/14 ]
Indeed we're waiting for indexes.
Comment by Sarath Lakshman [ 28/Jul/14 ]
This looks like a duplicate of recently filed bug, MB-11786


[couchdb:info,2014-07-26T18:56:22.325,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.427,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.528,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.629,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
Comment by Sarath Lakshman [ 28/Jul/14 ]
FYI,
[couchdb:error,2014-07-26T18:47:27.783,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16240.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

This message is a harmless message. I am planning to reduce its log level. Will be doing this as part of final round of cleanup.
Comment by Mike Wiederhold [ 28/Jul/14 ]
http://review.couchbase.org/#/c/39960/
Comment by Parag Agarwal [ 29/Jul/14 ]
Re-ran failing test, it passed




[MB-11805] KV+ XDCR System test: Missing items in bi-xdcr only Created: 23/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-998

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
Free for investigation. Not attaching data files.

Steps
--------
1a. Load on both clusters till vb_active_resident_items_ratio < 50.
1b. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one
Verify item count

Problem
-------------
standardbucket(C1) <---> standardbucket(C2)
On C1 - 57890744 items
On C2 - 57957032 items
standardbucket1(C1) ----> standardbucket1(C2)
On C1 - 14053020 items
On C2 - 14053020 items

Total number of missing items : 66,288

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.


Attached
-------------
cbcollect and list of keys that are missing on vb0


Missing keys
-------------------
Atleast 50-60 keys missing in every vbucket. Attaching all missing keys from vb0

vb0
-------
{'C1_node:': u'172.23.105.44',
'vb': 0,
'C2_node': u'172.23.105.54',
'C1_key_count': 78831,
 'C2_key_count': 78929,
 'missing_keys': 98}

     id: 06FA8A8B-11_110 deleted, tombstone exists
     id: 06FA8A8B-11_1354 present, report a bug!
     id: 06FA8A8B-11_1426 present, report a bug!
     id: 06FA8A8B-11_2175 present, report a bug!
     id: 06FA8A8B-11_2607 present, report a bug!
     id: 06FA8A8B-11_2797 present, report a bug!
     id: 06FA8A8B-11_3871 deleted, tombstone exists
     id: 06FA8A8B-11_4245 deleted, tombstone exists
     id: 06FA8A8B-11_4537 present, report a bug!
     id: 06FA8A8B-11_662 deleted, tombstone exists
     id: 06FA8A8B-11_6960 present, report a bug!
     id: 06FA8A8B-11_7064 present, report a bug!
     id: 3600C830-80_1298 present, report a bug!
     id: 3600C830-80_1308 present, report a bug!
     id: 3600C830-80_2129 present, report a bug!
     id: 3600C830-80_4219 deleted, tombstone exists
     id: 3600C830-80_4389 deleted, tombstone exists
     id: 3600C830-80_7038 present, report a bug!
     id: 3FEF1B93-91_2890 present, report a bug!
     id: 3FEF1B93-91_2900 present, report a bug!
     id: 3FEF1B93-91_3004 present, report a bug!
     id: 3FEF1B93-91_3194 present, report a bug!
     id: 3FEF1B93-91_3776 deleted, tombstone exists
     id: 3FEF1B93-91_753 present, report a bug!
     id: 52D6D916-120_1837 present, report a bug!
     id: 52D6D916-120_3282 present, report a bug!
     id: 52D6D916-120_3312 present, report a bug!
     id: 52D6D916-120_3460 present, report a bug!
     id: 52D6D916-120_376 deleted, tombstone exists
     id: 52D6D916-120_404 deleted, tombstone exists
     id: 52D6D916-120_4926 present, report a bug!
     id: 52D6D916-120_5022 present, report a bug!
     id: 52D6D916-120_5750 present, report a bug!
     id: 52D6D916-120_594 deleted, tombstone exists
     id: 52D6D916-120_6203 present, report a bug!
     id: 5C12B75A-142_2889 present, report a bug!
     id: 5C12B75A-142_2919 present, report a bug!
     id: 5C12B75A-142_569 deleted, tombstone exists
     id: 73C89FDB-102_1013 present, report a bug!
     id: 73C89FDB-102_1183 present, report a bug!
     id: 73C89FDB-102_1761 present, report a bug!
     id: 73C89FDB-102_2232 present, report a bug!
     id: 73C89FDB-102_2540 present, report a bug!
     id: 73C89FDB-102_4092 deleted, tombstone exists
     id: 73C89FDB-102_4102 deleted, tombstone exists
     id: 73C89FDB-102_668 deleted, tombstone exists
     id: 87B03DB1-62_3369 present, report a bug!
     id: 8DA39D2B-131_1949 present, report a bug!
     id: 8DA39D2B-131_725 deleted, tombstone exists
     id: A2CC835C-00_2926 present, report a bug!
     id: A2CC835C-00_3022 present, report a bug!
     id: A2CC835C-00_3750 present, report a bug!
     id: A2CC835C-00_5282 present, report a bug!
     id: A2CC835C-00_5312 present, report a bug!
     id: A2CC835C-00_5460 present, report a bug!
     id: A2CC835C-00_6133 present, report a bug!
     id: A2CC835C-00_6641 present, report a bug!
     id: A5C9F867-33_1091 present, report a bug!
     id: A5C9F867-33_1101 present, report a bug!
     id: A5C9F867-33_1673 present, report a bug!
     id: A5C9F867-33_2320 present, report a bug!
     id: A5C9F867-33_2452 present, report a bug!
     id: A5C9F867-33_4010 deleted, tombstone exists
     id: A5C9F867-33_4180 deleted, tombstone exists
     id: CD7B0436-153_3638 present, report a bug!
     id: CD7B0436-153_828 present, report a bug!
     id: D94DA3B2-51_829 present, report a bug!
     id: DE161E9D-40_1235 present, report a bug!
     id: DE161E9D-40_1547 present, report a bug!
     id: DE161E9D-40_2014 present, report a bug!
     id: DE161E9D-40_2184 present, report a bug!
     id: DE161E9D-40_2766 present, report a bug!
     id: DE161E9D-40_3880 deleted, tombstone exists
     id: DE161E9D-40_3910 deleted, tombstone exists
     id: DE161E9D-40_4324 deleted, tombstone exists
     id: DE161E9D-40_4456 deleted, tombstone exists
     id: DE161E9D-40_6801 present, report a bug!
     id: DE161E9D-40_6991 present, report a bug!
     id: DE161E9D-40_7095 present, report a bug!
     id: DE161E9D-40_7105 present, report a bug!
     id: DE161E9D-40_940 present, report a bug!
     id: E9F46ECC-22_173 deleted, tombstone exists
     id: E9F46ECC-22_2883 present, report a bug!
     id: E9F46ECC-22_2913 present, report a bug!
     id: E9F46ECC-22_3017 present, report a bug!
     id: E9F46ECC-22_3187 present, report a bug!
     id: E9F46ECC-22_3765 deleted, tombstone exists
     id: E9F46ECC-22_5327 present, report a bug!
     id: E9F46ECC-22_5455 present, report a bug!
     id: E9F46ECC-22_601 deleted, tombstone exists
     id: E9F46ECC-22_6096 present, report a bug!
     id: E9F46ECC-22_6106 present, report a bug!
     id: E9F46ECC-22_6674 present, report a bug!
     id: E9F46ECC-22_791 present, report a bug!
     id: ECD6BE16-113_2961 present, report a bug!
     id: ECD6BE16-113_3065 present, report a bug!
     id: ECD6BE16-113_3687 present, report a bug!
     id: ECD6BE16-113_3717 present, report a bug!

74 undeleted key(s) present on C2(.54) compared to C1(.44)











 Comments   
Comment by Aruna Piravi [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11805/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11805/C2.tar
Comment by Aruna Piravi [ 25/Jul/14 ]
[7/23/14 1:40:12 PM] Aruna Piraviperumal: hi Mike, I see some backfill stmts like in MB-11725 but that doesn't lead to any missing items
[7/23/14 1:40:13 PM] Aruna Piraviperumal: 172.23.105.47
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
 
172.23.105.50


172.23.105.59


172.23.105.62


172.23.105.45
/opt/couchbase/var/lib/couchbase/logs/memcached.log.27.txt:Tue Jul 22 16:02:46.470085 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-2ad6ab49733cf45595de9ee568c05798 - (vb 421) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.48


172.23.105.52
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.44
[7/23/14 1:56:12 PM] Michael Wiederhold: Having one of those isn't necessarily bad. Let me take a quick look
[7/23/14 2:02:49 PM] Michael Wiederhold: Ok this is good. I'll debug it a little bit more. Also, I don't necessarily expect that data loss will always occur because it's possible that the items could have already been replicated.
[7/23/14 2:03:38 PM] Aruna Piraviperumal: ok
[7/23/14 2:03:50 PM] Aruna Piraviperumal: I'm noticing data loss on standard bucket though
[7/23/14 2:04:19 PM] Aruna Piraviperumal: but no such disk snapshot logs found for 'standardbucket'
Comment by Mike Wiederhold [ 25/Jul/14 ]
For vbucket 0 in the logs I see that on the source side we have high seqno 102957, but on the destination we only have up to seqno 97705 so it appears that some items were not sent to the remote side. I also see in the logs that xdcr did request those items as shown in the log messages below.

memcached<0.78.0>: Wed Jul 23 12:30:02.506513 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 95291 and end seqno 0
memcached<0.78.0>: Wed Jul 23 13:30:01.683760 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) stream created with start seqno 95291 and end seqno 102957
memcached<0.78.0>: Wed Jul 23 13:30:02.070134 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) Stream closing, 0 items sent from disk, 7666 items sent from memory, 102957 was last seqno sent
[ns_server:info,2014-07-23T13:30:10.753,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Wed Jul 23 13:30:10.552586 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 102957 and end seqno 0
Comment by Mike Wiederhold [ 25/Jul/14 ]
Alk,

See my comments above. Can you verify that all items were sent by the xdcr module correctly?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Let me quickly note that .tar is again in fact .tar.gz.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
missing:

a) data files (so that I can double-check your finding)

b) xdcr traces
Comment by Aruna Piravi [ 25/Jul/14 ]
1. For system tests, data files are huge, I did not attach them, the cluster is available.
2. xdcr traces were not enabled for this run, my apologies but we discard all info we have in hand? Another complete run will take 3 days. I'm not sure if we want to delay investigation for that long.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
There's no way to investigate such delicate issue without having at least traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
If all files are large you can at least attach that vbucket 0 where you found discrepancies.
Comment by Aruna Piravi [ 25/Jul/14 ]
> There's no way to investigate such delicate issue without having at least traces.
If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

>If all files are large you can at least attach that vbucket 0 where you found discrepancies.
 I can, if requested. The cluster was anyway left available.

Fine, let me do another run if there's no way to work around not having traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
>> > There's no way to investigate such delicate issue without having at least traces.

>> If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

Not possible. We log potentially critical information. But _your_ tests are all semi-automated right? So for your automation it makes sense indeed to always enable xdcr tracing.
Comment by Aruna Piravi [ 25/Jul/14 ]
System test is completely automated. Only the post-test verification is not. But enabling tracing is now a part of the framework.




[MB-9710] count of connections a little misleading Created: 10/Dec/13  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Minor
Reporter: Perry Krug Assignee: Perry Krug
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged

 Description   
With the latest 2.5 (build 994) I'm looking at the number of connections as shown in the UI.

At the moment, it is showing me 21 connections per server. Yet when I look at the underlying statistics, I see some misleading numbers:
bucket_conns: 21
curr_connections: 22
curr_conns_on_port_11209: 18
curr_conns_on_port_11210: 2
daemon_connections: 4

And via netstat on port 11210, there are 0 shown, and 34 on port 11209.

I'm opening this bug both for the ep-engine/memcached team to help explain which is most accurate, and then for the UI to pick that up.

 Comments   
Comment by Perry Krug [ 10/Dec/13 ]
Netstat for 11210 and 11209:
[root@ip-10-197-21-14 ~]# netstat -anp | grep 11210
tcp 0 0 0.0.0.0:11210 0.0.0.0:* LISTEN 3881/memcached
udp 0 0 0.0.0.0:11210 0.0.0.0:* 3881/memcached
[root@ip-10-197-21-14 ~]# netstat -anp | grep 11209
tcp 0 0 0.0.0.0:11209 0.0.0.0:* LISTEN 3881/memcached
tcp 0 0 127.0.0.1:51910 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.197.21.14:42786 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:37926 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:57303 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:11209 127.0.0.1:41525 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:39789 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:44641 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:34026 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:34556 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:39789 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:54895 10.196.82.15:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:37926 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:53357 10.196.82.15:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:44641 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:41957 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:41525 127.0.0.1:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:42786 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.197.21.14:48985 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:48985 10.197.21.14:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 127.0.0.1:11209 127.0.0.1:34026 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:41957 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:34556 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.82.15:59784 ESTABLISHED 3881/memcached
tcp 0 0 127.0.0.1:11209 127.0.0.1:51910 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.82.15:39580 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.197.21.14:57303 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:38091 10.196.76.2:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.198.10.70:48994 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.198.10.70:56929 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:46674 10.198.10.70:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:53288 10.196.76.2:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:51081 10.198.10.70:11209 ESTABLISHED 3832/beam.smp
tcp 0 0 10.197.21.14:11209 10.196.76.2:35293 ESTABLISHED 3881/memcached
tcp 0 0 10.197.21.14:11209 10.196.76.2:51339 ESTABLISHED 3881/memcached
udp 0 0 0.0.0.0:11209 0.0.0.0:* 3881/memcached
[root@ip-10-197-21-14 ~]#

UI is reporting 21 on this node. The main question here is to be able to explain to customers the correlation between a) their client connections b) the output of netstat and c) the metrics reported in the UI.
Comment by Cihan Biyikoglu [ 06/Jun/14 ]
downgrading to minor given it isn't a critical stat. keeping on 3.0 for now but should only consider this later in 3.0 cycle.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Trond Norbye [ 22/Jul/14 ]
bucket_conns is refcount - 1 and have the following description:

    /* count of connections + 1 for hashtable reference + number of
     * reserved connections for this bucket + number of temporary
     * references created by find_bucket & frieds.
     *
     * count of connections is count of engine_specific instances
     * having peh equal to this engine_handle. There's only one
     * exception which is connections for which on_disconnect callback
     * was called but which are kept alive by reserved > 0. For those
     * connections we drop refcount in on_disconnect but keep peh
     * field so that bucket_engine_release_cookie can decrement peh
     * refcount.
     *
     * Handle itself can be freed when this drops to zero. This can
     * only happen when bucket is deleted (but can happen later
     * because some connection can hold pointer longer) */

curr_connections is the total number of connection objects in use, and then you have a breakdown of the endpoints for where we have. daemon_conns is the number of connection objects used for "listening" tasks. we have 4 here, which I would guess is ipv4 and ipv6 for the endpoints...

I'm not absolutely sure why there is a mismatch between with the sum of these and the curr_connections. They are being counted differently so there are a number of sane explanations here. We reduce the count for the per port based count immediately when we initiate a disconnect for a connection, but the aggregated number is decremented when connection is completely closed (it may wait for an event in the engine it is connected to etc). Another reason they may differ is if the OS reports an error when closing the socket (!eintr && !eagain). In that case we'll have a zombie connection...

A better question is probably: what is the UI trying to show you ;-)




[MB-11772] Provide the facility to release free memory back to the OS from running mcd process Created: 21/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On many occasions we have seen tcmalloc being very "greedy" with free memory and not releasing it back to the OS very quickly. There have even been occasions where this has triggered the Linux OOM-killer due to the memcached process's having too much "free" tcmalloc memory still resident.

tcmalloc by design will /slowly/ return memory back to the OS - via madvise(DONT_NEED) - but this rate is very conservative, and it can only be changed currently by modifying an environment variable, which obviously cannot be done on a running process.

To help mitigate these problems in future, it would be very helpful to allow the user to request that free memory is released back to the OS.


 Comments   
Comment by Dave Rigby [ 21/Jul/14 ]
http://review.couchbase.org/#/c/39608/
Comment by Aleksey Kondratenko [ 30/Jul/14 ]
tcmalloc madvise settings can be changed on a running process. And there's also https://code.google.com/p/gperftools/source/detail?r=a92fc76f72318f7a46e91d9ef6dd24f2bcf44802 as of gperftools 2.2
Comment by Dave Rigby [ 31/Jul/14 ]
@Alk: Could you elaborate how tcmalloc's madvise can be changed on a running process - I see nothing about it in the tcmalloc docs [1].

Re: TCMALLOC_AGGRESSIVE_DECOMMIT - that's interesting, and I'll look to benchmark that setting v.s. normal tcmalloc and jemalloc (for possible policy changes for MB-10496). Note that the intent of this MB is to simply give us an "emergency button" which can be quickly (and relatively safely) implemented without making perormance-sensitive changes to our memory allocation strategy.

[1]: http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html
Comment by Aleksey Kondratenko [ 31/Jul/14 ]
both release rate and aggressive decommit can be changed at runtime via MallocExtension singleton.




[MB-11781] [Incremental offline xdcr upgrade] 2.0.1-170-rel - 3.0.0-973-rel, replica counts are not correct Created: 22/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Sangharsh Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Upgrade from 2.0.1-170 - 3.0.0-973

Ubuntu 12.04 TLS

Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/2f193298/10.3.3.218-7212014-740-diag.zip
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/86678a9d/10.3.3.218-diag.txt.gz
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/a272b793/10.3.3.218-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/46622b23/10.3.3.240-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/6da39af1/10.3.3.240-diag.txt.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/702bdaa2/10.3.3.240-7212014-738-diag.zip

[Destination]

10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ae25e869/10.3.3.225-diag.txt.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f44d13a3/10.3.3.225-7212014-734-couch.tar.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f88f7912/10.3.3.225-7212014-743-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/98090b83/10.3.3.239-7212014-734-couch.tar.gz
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ddb9b54c/10.3.3.239-7212014-741-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/e3ac7b07/10.3.3.239-diag.txt.gz
Is this a Regression?: Unknown

 Description   
[Jenkins]
http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/24/consoleFull

[Test]
/testrunner -i ubuntu_x64--36_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.incremental_offline_upgrade,initial_version=2.0.1-170-rel,sdata=False,bucket_topology=default:1>2;bucket0:1><2,upgrade_seq=src><dest


[Test Steps]
1, Installed Source (2 node) and Destination node(2 node) with 2.0.1-170-rel.
2. Change XDCR global settings: xdcrFailureRestartInterval=1, xdcrCheckpointInterval=60 on both cluster.
3. Setup Remote clusters (Bidirectional).

bucket0 <--> bucket0 (Bi-directional) 10.3.3.240 <---> 10.3.3.239
default ---> default (Uni-directional) 10.3.3.240 -----> 10.3.3.239

4. Load 1000 items on each bucket on Source cluster.
5. Load 1000 items on bucket0 on destination cluster.
6. Wait for replication to finish.
7. Offline Upgrade each node one by one to 3.0.0-973 along with load 1000 items on bucket0 and default at Source cluster.
8. Verify items on side.

Expected items on bucket0 - 6000 and default = 5000


[2014-07-21 09:46:45,612] - [task:463] INFO - Saw vb_active_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:45,628] - [data_helper:289] INFO - creating direct client 10.3.3.239:11210 default
[2014-07-21 09:46:45,732] - [data_helper:289] INFO - creating direct client 10.3.3.225:11210 default
[2014-07-21 09:46:45,811] - [task:463] INFO - Saw vb_replica_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:50,832] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:46:55,852] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:00,872] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:05,892] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:10,912] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:15,933] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:20,954] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:25,974] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:30,995] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:36,018] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:41,040] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:46,062] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:51,085] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:56,106] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:01,128] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:06,150] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:11,173] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket

8.

 Comments   
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th
Comment by Sriram Ganesan [ 30/Jul/14 ]
I ran this test on build 3.0.0-1064 and the test passed successfully on CentOS 5.8 VMs. This test was originally run on ubuntu 12.04 VMs. Would it be possible to verify this test on the latest build and check if this is still a problem?
Comment by Sangharsh Agarwal [ 30/Jul/14 ]
Test is passed in latest build i.e. 1035
Comment by Sangharsh Agarwal [ 30/Jul/14 ]
Bug seems to be fixed in build 1035.

http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/27/consoleFull




[MB-10434] Automated tests in gerrit seem to fail frequently due to failure to connect to github Created: 11/Mar/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We can get around this by keeping our own mirror of github locally. Another option would be to reschedule jobs when we cannot connect to gerrit.

Note that this happens pretty frequently.

http://factory.couchbase.com/job/ep-engine-gerrit-300/273/console

 Comments   
Comment by Wayne Siu [ 21/Jul/14 ]
Mike,
Since May, we have upgraded our firewall and internet connection. Do you still see this issue?
Comment by Mike Wiederhold [ 21/Jul/14 ]
No, but we also no longer run automated tests.




[MB-11689] [cache metadata]: No indication of what percentage of metadata is in RAM Created: 11/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-07-11 at 11.32.16.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
For the new Cache Metadata feature we don't appear to give the user any indication of how much metadata has been flushed out to disk.

See attached screenshot - while we do show the absolute amount of RAM used for metadata, there doesn't seem to be any indication of how much of the total is still in RAM.

Note: I had a brief look at the available stats (https://github.com/membase/ep-engine/blob/master/docs/stats.org) and couldn't see a stat about total metadata size (flushed to disk); so this may also need ep-engine if there isn't an underlying stat for this.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Anil Kumar [ 29/Jul/14 ]
Triage : Anil, Wayne .. July 29th

David - Let us know if you're planning to fix it by 3.0 RC or should we move this out to 3.0.1.
Comment by David Liao [ 30/Jul/14 ]
I added new stat ep_meta_data_disk:
http://www.couchbase.com/issues/browse/MB-11689




[MB-11822] numWorkers setting of 5 is treated as high priority but should be treated as low priority. Created: 25/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
https://github.com/couchbase/ep-engine/blob/master/src/workload.h#L44-48
we currently use the priority conversion formula as seen in above code snippet
this assign numWorkers setting of 5 high priority but the expectation is that <=5 is low priority.

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39891 thanks




[MB-11559] Memcached segfault right after initial cluster setup (master builds) Created: 26/Jun/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: bug-backlog
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: couchbase-server-enterprise_centos6_x86_64_0.0.0-1564-rel.rpm

Attachments: Zip Archive 000-1564.zip     Text File gdb.log    
Issue Links:
Duplicate
is duplicated by MB-11562 memcached crash with segmentation fau... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Comments   
Comment by Dave Rigby [ 28/Jun/14 ]
This is caused by some of the changes added (on 3.0.1 branch) by MB-11067. Fix incoming (prob Monday).
Comment by Dave Rigby [ 30/Jun/14 ]
http://review.couchbase.org/#/c/38968/

Note: depends on refactor of stats code: http://review.couchbase.org/#/c/38967




[MB-11785] mcd aborted in bucket_engine_release_cookie: "es != ((void *)0)" Created: 22/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64 vb cluster_run -n1

Attachments: Zip Archive collectinfo-2014-07-22T192534-n_0@127.0.0.1.zip    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Observed while running pyupr unit tests against latest from rel-3.0.0 branch.

 After about 20 tests the crash occurred on test_failover_log_n_producers_n_vbuckets. This test passes stand alone so I think it's a matter of running all the tests in succession and then coming across this issue.

backtrace:

Thread 228 (Thread 0x7fed2e7fc700 (LWP 695)):
#0 0x00007fed8b608f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed8b60c388 in __GI_abort () at abort.c:89
#2 0x00007fed8b601e36 in __assert_fail_base (fmt=0x7fed8b753718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fed8949f28c "es != ((void *)0)",
    file=file@entry=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=line@entry=3301,
    function=function@entry=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:92
#3 0x00007fed8b601ee2 in __GI___assert_fail (assertion=0x7fed8949f28c "es != ((void *)0)",
    file=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=3301,
    function=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:101
#4 0x00007fed8949d13d in bucket_engine_release_cookie (cookie=0x5b422e0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:3301
#5 0x00007fed8835343f in EventuallyPersistentEngine::releaseCookie (this=0x7fed4808f5d0, cookie=0x5b422e0)
    at /couchbase/ep-engine/src/ep_engine.cc:1883
#6 0x00007fed8838d730 in ConnHandler::releaseReference (this=0x7fed7c0544e0, force=false)
    at /couchbase/ep-engine/src/tapconnection.cc:306
#7 0x00007fed883a4de6 in UprConnMap::shutdownAllConnections (this=0x7fed4806e4e0)
    at /couchbase/ep-engine/src/tapconnmap.cc:1004
#8 0x00007fed88353e0a in EventuallyPersistentEngine::destroy (this=0x7fed4808f5d0, force=true)
    at /couchbase/ep-engine/src/ep_engine.cc:2034
#9 0x00007fed8834dc05 in EvpDestroy (handle=0x7fed4808f5d0, force=true) at /couchbase/ep-engine/src/ep_engine.cc:142
#10 0x00007fed89498a54 in engine_shutdown_thread (arg=0x7fed48080540)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1564
#11 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480a5b60) at /couchbase/platform/src/cb_pthreads.c:19
#12 0x00007fed8beba182 in start_thread (arg=0x7fed2e7fc700) at pthread_create.c:312
#13 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 226 (Thread 0x7fed71790700 (LWP 693)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093e80, mutex=0x7fed78093e48, ms=720)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78093e40, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78093e40, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78093e40, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78093e40, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480203e0) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71790700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 225 (Thread 0x7fed71f91700 (LWP 692)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093830, mutex=0x7fed780937f8, ms=86390052)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780937f0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780937f0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780937f0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780937f0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801d490) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71f91700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 224 (Thread 0x7fed72792700 (LWP 691)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3894)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801a670) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed72792700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 223 (Thread 0x7fed70f8f700 (LWP 690)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3893)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed48017850) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed70f8f700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 222 (Thread 0x7fed7078e700 (LWP 689)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1672)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b8e90) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed7078e700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 221 (Thread 0x7fed0effd700 (LWP 688)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1673)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b6890) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed0effd700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 210 (Thread 0x7fed0f7fe700 (LWP 661)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed740e8910)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed740667e0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 201 (Thread 0x7fed0ffff700 (LWP 644)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed74135070)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed74050ef0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

---Type <return> to continue, or q <return> to quit---
Thread 192 (Thread 0x7fed2cff9700 (LWP 627)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7c90)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c078340) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2cff9700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 183 (Thread 0x7fed2d7fa700 (LWP 610)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009e000)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5009dfe0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2d7fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 174 (Thread 0x7fed2dffb700 (LWP 593)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009dc30)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed50031010) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2dffb700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 165 (Thread 0x7fed2f7fe700 (LWP 576)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed481cef20)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480921c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 147 (Thread 0x7fed2effd700 (LWP 541)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed540015d0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54057b80) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2effd700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 138 (Thread 0x7fed6df89700 (LWP 523)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed78092aa0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78056ea0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6df89700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 120 (Thread 0x7fed2ffff700 (LWP 489)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7d10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c1b7ac0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 111 (Thread 0x7fed6cf87700 (LWP 472)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5008c030)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500adf50) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6cf87700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 102 (Thread 0x7fed6d788700 (LWP 455)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
---Type <return> to continue, or q <return> to quit---
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080450)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54091560) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6d788700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 93 (Thread 0x7fed6ff8d700 (LWP 438)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080ad0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54068db0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ff8d700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 57 (Thread 0x7fed6e78a700 (LWP 370)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50080230)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5008c360) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6e78a700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 48 (Thread 0x7fed6ef8b700 (LWP 352)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50000c10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500815b0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ef8b700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 39 (Thread 0x7fed6f78c700 (LWP 334)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed4807c290)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
---Type <return> to continue, or q <return> to quit---
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4806e4c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6f78c700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 13 (Thread 0x7fed817fa700 (LWP 292)):
#0 0x00007fed8b693d7d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b6c5334 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x00007fed88386dd2 in updateStatsThread (arg=0x7fed780343f0) at /couchbase/ep-engine/src/memory_tracker.cc:36
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78034450) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed817fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 10 (Thread 0x7fed8aec4700 (LWP 116)):
#0 0x00007fed8b6be6bd in read () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b64d4e0 in _IO_new_file_underflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at fileops.c:613
#2 0x00007fed8b64e46e in __GI__IO_default_uflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at genops.c:435
#3 0x00007fed8b642184 in __GI__IO_getline_info (fp=0x7fed8b992640 <_IO_2_1_stdin_>, buf=0x7fed8aec3e40 "", n=79, delim=10,
    extract_delim=1, eof=0x0) at iogetline.c:69
#4 0x00007fed8b641106 in _IO_fgets (buf=0x7fed8aec3e40 "", n=0, fp=0x7fed8b992640 <_IO_2_1_stdin_>) at iofgets.c:56
#5 0x00007fed8aec5b24 in check_stdin_thread (arg=0x41c0ee <shutdown_server>)
    at /couchbase/memcached/extensions/daemon/stdin_check.c:38
#6 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a66250) at /couchbase/platform/src/cb_pthreads.c:19
#7 0x00007fed8beba182 in start_thread (arg=0x7fed8aec4700) at pthread_create.c:312
#8 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 9 (Thread 0x7fed89ea3700 (LWP 117)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed8a6c3280 <cond>, mutex=0x7fed8a6c3240 <mutex>, ms=19000)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8a4c0fea in logger_thead_main (arg=0x1a66fe0) at /couchbase/memcached/extensions/loggers/file_logger.c:372
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a67050) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed89ea3700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fed89494700 (LWP 135)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9cb0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd0f0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed89494700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
---Type <return> to continue, or q <return> to quit---

Thread 7 (Thread 0x7fed88c93700 (LWP 136)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9da0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd240) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed88c93700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111






Thread 6 (Thread 0x7fed83fff700 (LWP 137)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9e90) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd390) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed83fff700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fed837fe700 (LWP 138)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9f80) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd4e0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed837fe700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fed82ffd700 (LWP 139)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca070) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd630) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed82ffd700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fed827fc700 (LWP 140)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca160) at /couchbase/memcached/daemon/thread.c:277
---Type <return> to continue, or q <return> to quit---
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd780) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed827fc700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fed81ffb700 (LWP 141)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca250) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd8d0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed81ffb700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fed8d764780 (LWP 113)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041d24e in main (argc=3, argv=0x7fff77aaa838) at /couchbase/memcached/daemon/memcached.c:8797

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Abhinav,

The backtrace indicates that the abort crash was caused by closing all the UPR connections during shutdown, which we made some fixes recently.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
Tommie, can you tell how to run these tests, so I could try reproducing on my system?
Comment by Tommie McAfee [ 24/Jul/14 ]
*start a cluster run node then:

git clone https://github.com/couchbaselabs/pyupr.git
cd pyupr
./pyupr -h 127.0.0.1:9000 -b dev


noticed all the tests may pass but memcached can silently abort in the background.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
1. ServerSide: If an upr producer or upr consumer already exists for that cookie, engine should return DISCONNECT: http://review.couchbase.org/#/c/39843
2. py-upr: In the test: test_failover_log_n_producers_n_vbuckets, you are essentially opening 1 connection and sending 1024 open connection messages, so many tests will need changes.
Comment by Chiyoung Seo [ 24/Jul/14 ]
Tommie,

The server side fix was merged.

Can you please fix the issue in the test script and retest it?
Comment by Tommie McAfee [ 25/Jul/14 ]
thanks, working now and affected tests pass with patch:

http://review.couchbase.org/#/c/39878/1




[MB-11721] DCP set vbucket state messages should be sent and processed as soon as possible Created: 14/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Mike Wiederhold [ 14/Jul/14 ]
http://review.couchbase.org/#/c/39375/
http://review.couchbase.org/#/c/39378/
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-11383] warmup_min_items_threshold setting is not honored correctly in 3.0 warmup. Created: 10/Jun/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Venu Uppalapati Assignee: Venu Uppalapati
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Steps to reproduce:

1)In 3.0 node, create default bucket and load 10,000 items using cbworkloadgen.
2)Run below at command line:
curl -XPOST -u Administrator:password -d 'ns_bucket:update_bucket_props("default", [{extra_config_string, "warmup_min_items_threshold=1"}]).' http://127.0.0.1:8091/diag/eval
3)Restart node for setting to take effect. Restart again for warmup with setting.
4)Issue, ./cbstats localhost:11210 raw warmup:
ep_warmup_estimated_key_count: 10000
ep_warmup_value_count: 1115
5)If I repeat above steps on 2.5.1 node I get:
ep_warmup_estimated_key_count: 10000
ep_warmup_value_count: 101

 Comments   
Comment by Abhinav Dangeti [ 08/Jul/14 ]
Likely because of parallelization. Could you tell me the time taken for warmup for the same scenario in 2.5.1 and 3.0.0.
Comment by Abhinav Dangeti [ 30/Jul/14 ]
Fix: http://review.couchbase.org/#/c/40062/




[MB-11434] 600-800% CPU consumption by memcached on source cluster. Created: 16/Jun/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Sundar Sridharan
Resolution: Duplicate Votes: 0
Labels: performance, releasenote
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File memcached_cpu.png     PNG File memcached_cpu_toy.png     PNG File memcached_cpu-vs-disk_write_queue.png    
Issue Links:
Relates to
relates to MB-11405 Shared thread pool: high CPU overhead... In Progress
relates to MB-11435 1500-2000% CPU utilization by beam.sm... Resolved
relates to MB-11738 Evaluate GIO CPU utilization on syste... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/290/artifact/
Is this a Regression?: Yes

 Description   
In XDCR scenarios, the CPU usage for memcached process is more than two times the usage in the previous release. This is due to increased scheduling overhead from the shared thread pool.
Workaround: Reduce the number of threads on systems that have more than 30 cores.

5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec/node, LAN

 Comments   
Comment by Pavel Paulau [ 26/Jun/14 ]
I have tried the same workload but without XDCR: CPU utilization is 300-400% (70-80% in 2.5.1).
Essentially there is a huge not related to XDCR overhead.

Also in MB-11435 we tried to slow down replication. It does help with Erlang CPU utilization but memcached consumption never drops below 500%.
Comment by Sundar Sridharan [ 13/Jul/14 ]
fix for distributed sleep added for review here http://review.couchbase.org/#/c/39210/ thanks
Comment by Sundar Sridharan [ 14/Jul/14 ]
hi Pavel, could you please check out the cpu usage with the toy build rpm couchbase-server-community_cent58-3.0.0-toy-sundar-x86_64_3.0.0-703-toy.rpm? thanks
Comment by Pavel Paulau [ 16/Jul/14 ]
CPU utilization toy build: 460-480
CPU utilization regular build: 510-520

Logs:
http://ci.sc.couchbase.com/view/lab/job/perf-dev/499/artifact/

No major impact on MB-11405.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Sundar Sridharan [ 18/Jul/14 ]
Dynamically configurable thread limits fix uploaded for review http://review.couchbase.org/#/c/39475/
Expected to mitigate heavy cpu usage and allow tunable testing
Comment by Pavel Paulau [ 20/Jul/14 ]
Recent changes introduced new pattern in CPU utilization:

http://cbmonitor.sc.couchbase.com/media/atlas_c1_300-988_7a5_accessatlas_c1_300-988_7a5172.23.100.18memcached_cpu.png

You can see that utilization rate is about 200-250% but there many periods when it increases to 500-600%.

Logs: http://ci.sc.couchbase.com/job/xdcr-5x5/381/artifact/

Pretty much the same situation with MB-11405.
Comment by Chiyoung Seo [ 21/Jul/14 ]
Pavel,

I don't too much worry about the higher CPU usage compared with 2.5.x. As you know, we have more memcached worker threads and more IO threads that are now limited by 8 writers and 16 readers as the max threshold. As long as they don't spend significant time on OS-level context switches, it should be totally okay.
Comment by Pavel Paulau [ 22/Jul/14 ]
Yes, that's totally fine to consume a lot of CPU resources. MB-11435 is a good example.

The main problem right now is regression in persistence speed (MB-11769 / MB-11731).

Attached chart demonstrates obvious correlation between CPU utilization and disk write queue. Let's address those issues first.

In any case, wasting 60-70% of time in context switching won't be acceptable.
Comment by Pavel Paulau [ 22/Jul/14 ]
Eliminating as duplicate (MB-11405) in order to avoid noise and confusion.




[MB-11794] Creating 10 buckets causes memcached segmentation fault Created: 23/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-998

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/396/artifact/
Is this a Regression?: Yes

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Sundar,

The backtrace indicates that it is mostly a regression from the vbucket-level lock change for flusher, vb snapshot, compaction, and vbucket deletion task, which we made recently.
Comment by Pavel Paulau [ 23/Jul/14 ]
The same issue happened with single bucket. The problem seems rather common.
Comment by Sundar Sridharan [ 24/Jul/14 ]
Found root cause - cachedVBStates is not preallocated and is modified in a thread unsafe manner. This regression shows up now because we have more parallelism with vbucket-level locking. Working on the fix.
Comment by Sundar Sridharan [ 24/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/#/c/39834/ thanks
Comment by Chiyoung Seo [ 24/Jul/14 ]
The fix was merged.




[MB-11797] Rebalance-out hangs during Rebalance + Views operation in DGM run Created: 23/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_02--view_query_extended-P1/145/consoleFull

Test to Reproduce:
./testrunner -i /tmp/ubuntu12-view6node.ini get-delays=True,get-cbcollect-info=True -t view.createdeleteview.CreateDeleteViewTests.incremental_rebalance_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction

Steps to Reproduce:
1. Setup 5-node cluster
2. Create default bucket
3. Load 200000 items
4. Load bucket to achieve dgm 10%
5. Create Views
6. Start ddoc + Rebalance out operations in parallel

Please refer attached log file "logs.txt".

Uploading Logs:


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/8586d8eb/172.23.106.201-7222014-2350-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/ea5d5a3f/172.23.106.199-7222014-2354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d06d7861/172.23.106.200-7222014-2355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/65653f65/172.23.106.198-7222014-2353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/dd05a054/172.23.106.197-7222014-2352-diag.zip
Comment by Sriram Melkote [ 23/Jul/14 ]
Nimish - to my eyes, it looks like views are not involved in this failure. Can you please take a look at the detailed log and assign to Alk if you agree? Thanks
Comment by Nimish Gupta [ 23/Jul/14 ]
From the logs:

[couchdb:info,2014-07-22T14:47:21.345,ns_1@172.23.106.199:<0.17993.2>:couch_log:info:39]Set view `default`, replica (prod) group `_design/dev_ddoc40`, signature `c018b62ae9eab43522a3d0c43ac48b3e`, terminating with reason: {upr_died,
                                                                                                                                       {bad_return_value,
                                                                                                                                        {stop,
                                                                                                                                         sasl_auth_failed}}}

One obvious problem is that we returned the wrong number of parameter for stop when sasl auth failed. That I have fixed, and is under review.(http://review.couchbase.org/#/c/39735/).

I don't know the reason why sasl auth failed or it may be normal for sasl auth to fail during rebalance. Meenakshi, could you please run the test again after this change is merged.
Comment by Nimish Gupta [ 23/Jul/14 ]
Trond has added code to log more information for sasl errors in memcached (http://review.couchbase.org/#/c/39738/). It will be helpful to debug sasl errors.
Comment by Meenakshi Goel [ 24/Jul/14 ]
Issue is reproducible with latest build 3.0.0-1020-rel.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/99/consoleFull
Uploading Logs shortly.
Comment by Meenakshi Goel [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/13f68e9c/172.23.106.186-7242014-1238-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/c0cf8496/172.23.106.187-7242014-1239-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/77b2fb50/172.23.106.188-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d0335545/172.23.106.189-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/7634b520/172.23.106.190-7242014-1241-diag.zip
Comment by Nimish Gupta [ 24/Jul/14 ]
From the ns_server logs, It looks to me memcached has crashed.

[error_logger:error,2014-07-24T12:28:36.305,ns_1@172.23.106.186:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.693.0>
    registered_name: []
    exception exit: {badmatch,{error,closed}}
      in function gen_server:init_it/6 (gen_server.erl, line 328)
    ancestors: ['single_bucket_sup-default',<0.675.0>]
    messages: []
    links: [<0.717.0>,<0.719.0>,<0.720.0>,<0.277.0>,<0.676.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 75113
    stack_size: 27
    reductions: 26397931
  neighbours:

Ep-engine/ns_server team please take a look.
Comment by Nimish Gupta [ 24/Jul/14 ]
From the logs:

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.31044.9>,
           {{{badmatch,{error,closed}},
             {gen_server,call,
                 ['ns_memcached-default',
                  {get_dcp_docs_estimate,321,
                      "replication:ns_1@172.23.106.187->ns_1@172.23.106.188:default"},
                  180000]}},
            {gen_server,call,
                [{'janitor_agent-default','ns_1@172.23.106.187'},
                 {if_rebalance,<0.15733.9>,
                     {wait_dcp_data_move,['ns_1@172.23.106.188'],321}},
                 infinity]}}}}
Comment by Sriram Melkote [ 25/Jul/14 ]
Alk, can you please take a look? Thanks!
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Quick hint for fellow coworkers. When you see connection closed usually first thing to check is if memcached has crashed. And in this case indeed it has (diag's cluster wide logs is perfect place to find this issues):

2014-07-24 12:28:35.861 ns_log:0:info:message(ns_1@172.23.106.186) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:09:47.941525 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 650) stream created with start seqno 5794 and end seqno 18446744073709551615
Thu Jul 24 12:09:49.115570 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 749, cookie 0x606f800
Thu Jul 24 12:09:49.380310 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 648, cookie 0x6070d00
Thu Jul 24 12:09:49.450869 PDT 3: (default) UPR (Consumer) eq_uprq:replication:ns_1@172.23.106.189->ns_1@172.23.106.186:default - (vb 648) Attempting to add takeover stream with start seqno 5463, end seqno 18446744073709551615, vbucket uuid 35529072769610, snap start seqno 5463, and snap end seqno 5463
Thu Jul 24 12:09:49.495674 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 648) stream created with start seqno 5463 and end seqno 18446744073709551615
2014-07-24 12:28:36.302 ns_memcached:0:info:message(ns_1@172.23.106.186) - Control connection to memcached on 'ns_1@172.23.106.186' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_memcached:0:info:message(ns_1@172.23.106.187) - Control connection to memcached on 'ns_1@172.23.106.187' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_log:0:info:message(ns_1@172.23.106.187) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:28:35.860224 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1019) Stream closing, 0 items sent from disk, 0 items sent from memory, 5781 was last seqno sent
Thu Jul 24 12:28:35.860235 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1020) Stream closing, 0 items sent from disk, 0 items sent from memory, 5879 was last seqno sent
Thu Jul 24 12:28:35.860246 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1021) Stream closing, 0 items sent from disk, 0 items sent from memory, 5772 was last seqno sent
Thu Jul 24 12:28:35.860256 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1022) Stream closing, 0 items sent from disk, 0 items sent from memory, 5427 was last seqno sent
Thu Jul 24 12:28:35.860266 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1023) Stream closing, 0 items sent from disk, 0 items sent from memory, 5480 was last seqno sent

Status 137 is 128 (death by signal (set by kernel)) + 9. So signal 9. dmesg (captured in couchbase.log) does not have signs of OOM. This means - humans :) Not the first and sadly not the last time something like this happens. Rogue scripts, bad tests etc.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Also we should stop the practice if reusing tickets for unrelated conditions. This doesn't look anywhere close to rebalance hang isnt?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Not sure what to do about this one. Closing as incomplete will probably not hurt.




[MB-11326] [memcached] Function call argument is an uninitialised value in upr_stream_req_executor Created: 05/Jun/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Dave Rigby Assignee: Trond Norbye
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File report-3a6911.html    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Bug reported by the clang static analyzer.

Description: Function call argument is an uninitialized value
File: /Users/dave/repos/couchbase/server/source/memcached/daemon/memcached.c upr_stream_req_executor()
Line: 4242

See attached report.

From speaking to Trond offline he believes that it shouldn't be possible to enter upr_stream_req_executor() with c->aiostat == ENGINE_ROLLBACK (which is what triggers this error) - in which case we should just add a suitable assert() to squash the warning.

 Comments   
Comment by Dave Rigby [ 20/Jun/14 ]
http://review.couchbase.org/#/c/38560/
Comment by Wayne Siu [ 08/Jul/14 ]
Hi Trond,
The patchset is ready for review.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-11796] Rebalance after manual failover hangs (delta recovery) Created: 23/Jul/14  Updated: 31/Jul/14  Resolved: 31/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified