[MB-11825] Rebalance may fail if cluster_compat_mode:is_node_compatible times out waiting for ns_doctor:get_node Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: customer, rebalance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: No

 Description   
Saw this in CBSE-1301:

 <0.2025.3344> exited with {{function_clause,
                            [{new_ns_replicas_builder,handle_info,
                              [{#Ref<0.0.4447.107509>,
                                [stale,
                                 {last_heard,{1406,315410,842219}},
                                 {now,{1406,315410,210848}},
                                 {active_buckets,
                                  ["user_reg","sentence","telemetry","common",
                                   "notifications","playlists","users"]},
                                 {ready_buckets,

which caused rebalance to fail.

The reason is that new_ns_replicas_builder doesn't have catch-all handle_info that's typical for gen_servers. And this message occurs because of the following call chain:

* new_ns_replicas_builder:init/1

* ns_replicas_builder_utils:spawn_replica_builder/5

* ebucketmigrator_srv:build_args

* cluster_compat_mode:is_node_compatible

* ns_doctor:get_node

ns_doctor:get_node handles timeout and returns empty list. So if this happens actual reply may be delivered later and be handled by handle_info. Which in this case is unable to do it.

3.0 is mostly immune to this particular chain of calls due to optimization:

commit 70badff90b03176b357cac4d03e40acc62f4861b
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Tue Oct 1 11:44:02 2013 -0700

    MB-9096: optimized is_node_compatible when cluster is compatible
    
    There's no need to check for particular node's compatibility with
    certain feature if entire cluster's mode is new enough.
    
    Change-Id: I9573e6b2049cb00d2adad709ba41ec5285d66a6b
    Reviewed-on: http://review.couchbase.org/29317
    Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
    Reviewed-by: Artem Stemkovski <artem@couchbase.com>


 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
http://review.couchbase.org/39908




[MB-11824] [system test] [kv unix] rebalance hang at 0% when add a node to cluster Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.4 64-bit

Attachments: Zip Archive 172.23.107.195-7252014-1342-diag.zip     Zip Archive 172.23.107.196-7252014-1345-diag.zip     Zip Archive 172.23.107.197-7252014-1349-diag.zip     Zip Archive 172.23.107.199-7252014-1352-diag.zip     Zip Archive 172.23.107.200-7252014-1356-diag.zip     Zip Archive 172.23.107.201-7252014-143-diag.zip     Zip Archive 172.23.107.202-7252014-1359-diag.zip     Zip Archive 172.23.107.203-7252014-146-diag.zip    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on 8 nodes
1:172.23.107.195
2:172.23.107.196
3:172.23.107.197
4:172.23.107.199
5:172.23.107.200
6:172.23.107.202
7:172.23.107.201

8:172.23.107.203

Create a cluster of 7 nodes
Create 2 buckets: default and sasl-2 (no view)
Load 25+ M items to each bucket to bring down active resident ratio down to 80%
Do update, expired and delete on both buckets in 3 hours.
Then add node 203 to cluster. Rebalance hang at 0%

Live cluster is available to debug


 Comments   
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809. We currently have two bug fixes in that fix rebalance stuck issues. (MB-11809 and MB-11786. Please run the tests with these changes merged before filing any other rebalance stuck issues.




[MB-11823] buildbot for ubuntu 10.04 looks like hang Created: 25/Jul/14  Updated: 26/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
http://builds.hq.northscale.net:8010/builders/ubuntu-1004-x64-300-builder
I saw build job pending more than 13+ hours

 Comments   
Comment by Chris Hillery [ 26/Jul/14 ]
buildbot master had gotten stuck in an odd state where it didn't hand any builds to that buildslave, even though it was connected and idle. I believe I have unstuck it by temporarily re-assigning that builder to another slave. I've put it back to the original slave now, and will watch to ensure the next build (1077) runs successfully.




[MB-11822] numWorkers setting of 5 is treated as high priority but should be treated as low priority. Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
https://github.com/couchbase/ep-engine/blob/master/src/workload.h#L44-48
we currently use the priority conversion formula as seen in above code snippet
this assign numWorkers setting of 5 high priority but the expectation is that <=5 is low priority.

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39891 thanks




[MB-11821] Rename UPR to DCP in stats and loggings Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sundar Sridharan Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
ep-engine side changes are http://review.couchbase.org/#/c/39898/ thanks




[MB-11820] beer-sample loading is stuck in crashed state (was: Rebalance not available 'pending add rebalanace', beer-sample loading is stuck) Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
4 node cluster
scenario

1 - Created beer-sample right after creation of the cluster
2 - Right after bucket started loading, auto-generated load started running on it
3 - After many many minutes, I added a few nodes and noticed that I couldn't rebalance. Digging in further, I saw that the beer-sample loading was still going on but not making any progress.

Logs are at:
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.74.148.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.87.131.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.0.243.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.21.69.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.22.57.zip

 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Converting this ticket to beer sample loading is stuck. Lack of rebalance warning is other existing and still in works ticket.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Here's what I have in my logs that's output from docloader:

5 matches for "output from beer-sample" in buffer: ns_server.debug.log
  19416:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - existing buckets : [u'beer-sample']\n"
  19417:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - found bucket beer-sample\n"
  19450:[ns_server:debug,2014-07-24T23:40:10.387,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "Traceback (most recent call last):\n File \"/opt/couchbase/lib/python/cbdocloader\", line 241, in ?\n main()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 233, in main\n"
  19451:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " docloader.populate_docs()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 191, in populate_docs\n self.unzip_file_and_upload()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 175, in unzip_file_and_upload\n self.enumerate_and_save(working_dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 155, in enumerate_and_save\n self.save_doc(dockey, fp)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 133, in save_doc\n self.bucket.set(dockey, 0, 0, raw_data)\n File \"/opt/couchbase/lib/python/couchbase/client.py\", line 232, in set\n self.mc_client.set(key, expiration, flags, value)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 927, in set\n"
  19452:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " return self._respond(item, event)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 883, in _respond\n raise item[\"response\"][\"error\"]\ncouchbase.couchbaseclient.MemcachedError: Memcached error #134: Temporary failure\n"

It don't know if docloader is truly stuck or if it is retrying and getting tmperrors all the time.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
I don't know who owns docloader but AFAIK it was Bin. I've also heard about some attempts to rewrite it in go.

CC-ed a bunch of possibly related folks.




[MB-11819] XDCR: Rebalance at destination hangs, missing replica items Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Duplicate Votes: 0
Labels: rebalance-hang
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 172.23.106.45-7242014-208-diag.zip     Zip Archive 172.23.106.46-7242014-2010-diag.zip     Zip Archive 172.23.106.47-7242014-2011-diag.zip     Zip Archive 172.23.106.48-7242014-2013-diag.zip    
Issue Links:
Duplicate
duplicates MB-11809 {UPR}:: Rebalance-in of 2 nodes is st... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-1014

Scenario
------------
1. Uni-xdcr between 2-node clusters, default bucket
2. Load 30K items on source
3. Pause XDCR
4. Start "rebalance-out" of one node each from both clusters simultaneously.
5. Resume xdcr

Rebalance at source proceeds to completion, rebalance on dest hangs at 10%, see -

',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at cluster 172.23.106.47
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:07,325] - [rest_client:1216] INFO - rebalance percentage : 100 %
[2014-07-24 13:28:30,222] - [task:411] INFO - rebalancing was completed with progress: 100% in 83.475001812 sec
[2014-07-24 13:28:30,223] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:28:30,229] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:40,252] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:50,280] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:00,301] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:10,342] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:20,363] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:30,389] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:40,410] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:50,437] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:00,458] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:10,480] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:20,504] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:30,523] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:40,546] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:50,569] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %

Testcase
--------------
./testrunner -i uni-xdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=unidirection,ctopology=chain,replication_type=xmem,rebalance_out=source-destination,pause=source,GROUP=P1


The rebalance hang to explain the missing replica items?

[2014-07-24 13:31:49,079] - [task:463] INFO - Saw curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,103] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,343] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:49,536] - [task:463] INFO - Saw vb_active_curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,559] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,811] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:50,001] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:31:55,045] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:00,080] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:05,113] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Logs
-------------
will attach cbcollect with xdcr trace logging.

 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Do you have _any reason at all_ to believe that it's even remotely related to xdcr ? Specifically xdcr does nothing about upr replicas.
Comment by Aruna Piravi [ 24/Jul/14 ]
I, of course _do_ know that replicas have nothing to do with xdcr. But I'm unsure if xdcr, and parallel rebalance contributed to the hang.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
I cannot diagnose stuck rebalance when logs are capture after cleanup.
Comment by Aruna Piravi [ 24/Jul/14 ]
And more on why I think so ---

Pls note from logs below that there has been no progress in rebalance at the destination _from_ the time we resumed xdcr. Until then it had progressed to 10%.

[2014-07-24 13:26:59,500] - [pauseResumeXDCR:92] INFO - ##### Pausing xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:26:59,541] - [rest_client:1757] INFO - Updated pauseRequested=true on bucket'default' on 172.23.106.45
[2014-07-24 13:26:59,968] - [task:517] WARNING - Not Ready: xdc_ops 1734 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:27:00,145] - [task:521] INFO - Saw replication_docs_rep_queue 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:00,339] - [task:517] WARNING - Not Ready: replication_active_vbreps 16 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091', default bucket
[2014-07-24 13:27:05,490] - [task:521] INFO - Saw xdc_ops 0 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:27:05,697] - [task:521] INFO - Saw replication_active_vbreps 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at source cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at source cluster 172.23.106.47
[2014-07-24 13:27:05,761] - [xdcrbasetests:372] INFO - sleep for 5 secs. ...
[2014-07-24 13:27:06,733] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 13:27:06,746] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,773] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 13:27:06,796] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:10,823] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:27:10,860] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 13:27:11,101] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 894
[2014-07-24 13:27:11,102] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:12,043] - [task:521] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 13:27:12,260] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 869
[2014-07-24 13:27:12,261] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:13,142] - [task:521] INFO - Saw xdc_ops 4770 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
Comment by Aruna Piravi [ 24/Jul/14 ]
Live cluster

http://172.23.106.45:8091/
http://172.23.106.47:8091/ <-- rebalance stuck
Comment by Aruna Piravi [ 24/Jul/14 ]
New logs attached.
Comment by Aruna Piravi [ 24/Jul/14 ]
Didn't try pausing replication from source cluster. Wanted the leave the cluster in same state.

.47 started receiving data through resumed xdcr from 20:04:01. The last recorded rebalance progress was 8.7890625 % at 20:04:05 on .47. Could have stopped a few secs before that.

[2014-07-24 20:03:55,538] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 20:03:55,547] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,569] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 20:03:55,578] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,584] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:55,592] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:59,629] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 20:03:59,665] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 20:03:59,799] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1010
[2014-07-24 20:03:59,800] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:00,803] - [task:523] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 20:04:01,019] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1082
[2014-07-24 20:04:01,020] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:01,877] - [task:523] INFO - Saw xdc_ops 4981 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 20:04:01,888] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 20:04:05,894] - [rest_client:1216] INFO - rebalance percentage : 10.7421875 %
[2014-07-24 20:04:05,905] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:15,927] - [rest_client:1216] INFO - rebalance percentage : 19.53125 %
[2014-07-24 20:04:15,937] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:25,956] - [rest_client:1216] INFO - rebalance percentage : 26.7578125 %
[2014-07-24 20:04:25,964] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:35,995] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 20:04:36,007] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:46,030] - [rest_client:1216] INFO - rebalance percentage : 50.9114583333 %
[2014-07-24 20:04:46,037] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:56,060] - [rest_client:1216] INFO - rebalance percentage : 59.7005208333 %
[2014-07-24 20:04:56,068] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:05:06,087] - [rest_client:1216] INFO - rebalance percentage : 99.9348958333 %
[2014-07-24 20:05:06,096] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Same symptoms as MB-11809:

     {<0.4446.17>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007fdb6c22ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007fdb1022d3a8 Return addr 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.12.179156>">>,<<"y(1) infinity">>,
                   <<"y(2) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.147.17>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007fdb1022d3e0 Return addr 0x00007fdb1b1ed020 (janitor_agent:'-spawn_rebalance_subprocess/3-fun-0-'/3 + 200)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) Catch 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007fdb1022d408 Return addr 0x00007fdb6c2338a0 (proc_lib:init_p/3 + 688)">>,
                   <<"y(0) <0.160.17>">>,<<>>,
                   <<"0x00007fdb1022d418 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007fdb6c2338c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,0}]},
       {heap_size,233},
       {total_heap_size,233},
       {links,[<0.160.17>,<0.186.17>]},
       {memory,2816},
       {message_queue_len,0},
       {reductions,29},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Aruna, consider pausing xdcr. It is likely unrelated to xdcr given MB- reference above
Comment by Aruna Piravi [ 25/Jul/14 ]
I paused xdcr last night. No progress on rebalance yet. That rules out xdcr completely?
Comment by Aruna Piravi [ 25/Jul/14 ]
Raising as test blocker. ~10 tests failed to this rebalance hang problem. Feel free to close if found to be a duplicate if MB-11809.
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809




[MB-11818] couchbase cli in cluster-wide collectinfo failed to start to collect selected nodes Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on 4 nodes
Run couchbase cli to do cluster-wide collectinfo on one node
The collection failed to start

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149


 Comments   
Comment by Bin Cui [ 25/Jul/14 ]
I am confused. Are you sure you want to use collect-logs-stop to start collecting ?
Comment by Thuan Nguyen [ 25/Jul/14 ]
Oop I copy the wrong command
Here is command failed to start collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
Comment by Bin Cui [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39889/




[MB-11817] cluster-wide cli does not printout result success or failed when start collect log Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on one node
Run root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --allnodes
root@ubuntu:~#
Collection start showing in UI but in command line, it shows nothing. I don't know if it is success or failed





[MB-11816] coucbase-cli failed to collect log in cluster-wide collection Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.deb.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on one ubuntu 12.04 node
Run cluster-wide collectinfo using couchbase-cli
Failed to collect

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c localhost:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39848/
Comment by Thuan Nguyen [ 25/Jul/14 ]
Verified on build 3.0.0-1028. This bug was fixed.




[MB-11815] Support Ubuntu 14.04 as supported platform Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Anil Kumar Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We need to add support for Ubuntu 14.04.




[MB-11814] Failover decision at bucket level Created: 24/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Parag Agarwal Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Scenario Example

Cluster of 4 nodes with all nodes healthy has two buckets with replica=0 and 1. When doing failover of a node we get a hard failover option since we consider the minimum replica presence across all buckets to make this choice. In order to do graceful failover, we would require replica =1 at least across all buckets

This can be improved by letting the system decide if graceful failover of a particular bucket is possible based on its replica and not another bucket with lesser replica count (which qualifies for hard failover only). Since graceful failover avoids data loss as compared to hard failover, we will reduce data loss situation.



 Comments   
Comment by Dave Rigby [ 25/Jul/14 ]
Similary for auto-failover - currently if you have a bucket with zero replicas it essentially "blocks" auto-failover of another bucket which does have replicas.




[MB-11813] windows 64-bit buildbot failed to build new 64-bit builds. Failed to throw out error Created: 24/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Windows 64-bit buildbot failed to build new build
http://builds.hq.northscale.net:8010/builders/server-30-win-x64-300/builds/411
No errors throw out.

 Comments   
Comment by Thuan Nguyen [ 24/Jul/14 ]
This 64-bit builder shows build successful but actually no build built.
Comment by Chris Hillery [ 24/Jul/14 ]
The build isn't performed by buildbot; buildbot only spawns the Jenkins job:

http://factory.couchbase.com/job/cs_300_win6408/

And that job is still ongoing.




[MB-11812] Need a read-only mode to startup the query server Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Don Pinto Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This is required for the tutorial in production as we don't want any user to blow off the data, or add additional data.

All DML queries should be blocked when the server is started in this mode. Only the admin should be start the query server in read-only mode.






[MB-11811] [Tools] Change UPR to DCP for tools Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39814/




[MB-11810] No feedback (similar to rebalane fails) if installation of sample buckets fails... Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Trond Norbye Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I tried to install the sample buckets on my windows server, and the buckets is created but there is no data in them when I look at them. I would have expected a "red error stripe" just like I get with for instance rebalance errors so that I know that something went wrong.

I took a look in the "log" section I see what my mum would call a cryptic error message:

Loading sample bucket gamesim-sample failed: {failed_to_load_samples_with_status,
1}

When I tried to run the program cbdocloader with the appropriate arguments I get a more informative error message:

"Failed to listen listen unix /tmp/log_upr_client.sock: Det ble brukt en adresse som var inkompatibel med den forespurte protokollen"

(I'm using my go version of cbdocloader which someone just modified to use the retriever logging project which use unix sockets which don't work on windows).

It would be nice if we could add the output from the process in the log. it would make it easier to debug the problem for customers (they will probably not know the name (and the arguments) for the binary we tried to use)


 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Deserves proper design.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Also please note that we don't capture output of samples loader because:

a) erlang doesn't allow us to read stdout and stderr separately

b) original docloader is quite noisy

Things might indeed change once we have a better loader implementation that can said to output only errors.




[MB-11809] {UPR}:: Rebalance-in of 2 nodes is stuck when doing Ops Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-11819 XDCR: Rebalance at destination hangs,... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1014, centos 6x

Vms:: 10.6.2.144-150

1. Create 7 node cluster
2. Create default bucket
3. Add 400 K items
4. Do mutations and rebalance-out 2 nodes
5. Do mutations and rebalance-in 2 nodes

Step 5 leads to rebalance being stuck

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.incremental_rebalance_out_in_with_mutation,init_num_nodes=3,items=400000,skip_cleanup=True,GROUP=IN_OUT;P0


 Comments   
Comment by Parag Agarwal [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11809/log.tar.gz
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Takeover request appear to be stuck. Thats on node .147.

     {<19779.11046.0>,
      [{registered_name,'replication_manager-default'},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f1b1d12ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f1ad3083860 Return addr 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.0.169038>">>,<<"y(1) infinity">>,
                   <<"y(2) {takeover,78}">>,<<"y(3) '$gen_call'">>,
                   <<"y(4) <0.11353.0>">>,<<"y(5) []">>,<<>>,
                   <<"0x00007f1ad3083898 Return addr 0x00007f1acbd79e70 (replication_manager:handle_call/3 + 2840)">>,
                   <<"y(0) infinity">>,<<"y(1) {takeover,78}">>,
                   <<"y(2) 'upr_replicator-default-ns_1@10.6.2.146'">>,
                   <<"y(3) Catch 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f1ad30838c0 Return addr 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<"y(0) [{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FM\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.149',\"789\"}]">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<>>,
                   <<"0x00007f1ad30838d8 Return addr 0x00007f1b1d133ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) replication_manager">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) <0.11029.0>">>,
                   <<"y(4) {dcp_takeover,'ns_1@10.6.2.146',78}">>,
                   <<"y(5) {<0.11528.0>,#Ref<0.0.0.169027>}">>,
                   <<"y(6) Catch 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<>>,
                   <<"0x00007f1ad3083918 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f1b1d133ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,42}]},
       {heap_size,610},
       {total_heap_size,2208},
       {links,[<19779.11029.0>]},
       {memory,18856},
       {message_queue_len,2},
       {reductions,17287},
       {trap_exit,true}]}
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39894




[MB-11808] GeoSpatial in 3.0 Created: 24/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, UI, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We must hide GeoSpatial related UI elements in 3.0 release, as we have not completed the task of moving GeoSpatial features over to UPR.

We should use the simplest way to hide elements (like "display:none" attribute) because we fully expect to resurface this in 3.0.1


 Comments   
Comment by Sriram Melkote [ 24/Jul/14 ]
In the 3.0 release meeting, it was fairly clear that we won't be able to add Geo support for 3.0 due to the release being in Beta phase now and heading to code freeze soon. So, we should plan for it in 3.0.1 - updating description to reflect this.




[MB-11807] couchbase server failed to start in ubuntu when upgrade from 2.0 to 3.0 if it could not find the database Created: 23/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.0 in one ubuntu 12.04 64-bit
Initialize it with custom data and index path (/tmp/data and /tmp/index)
Create default bucket
Load 1K items to bucket
Shutdown couchbase server
Remove all files under /tmp/data/ and /tmp/index
Upgrade couchbase server to 3.0.0-995
Couchbase server failed to start due to could not find database.
Manually start couchbase server. Couchbase server starts normally with no items in bucket as expected.

The point here is that couchbase server should start even it could not find database files

It may relate to bug MB-7705




[MB-11806] rebalance should not be allowed when cbrecovery is stopped by REST API or has not completed Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ashvinder Singh Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: ns_server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos, ubuntu

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Found in build-3.0.0-973-release

Setup: Two clusters: src and dst with 3 nodes each. Please have 2 spare nodes
- Setup xdcr between src and dst cluster
- Ensure xdcr is setup and complete
- Hard Failover two nodes from dst cluster
- Verify nodes failover
- Add two spare nodes in dst cluster
- Initiate cbrecovery from src to dst
- stop cbrecovery using REST API
http://10.3.121.106:8091//pools/default/buckets/default/controller/stopRecovery?recovery_uuid=3ad71c7b3365593e0979da34306fb2a5

- initiate rebalance operation on dst cluster.

Observations: rebalance operation starts.
Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interfaces.


 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
First of all you're doing it wrong here. Intended use of cbrecovery is to recover _source_ by using data from destination.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
stop recovery is stop recovery. We do allow rebalance in this case by design.
Comment by Andrei Baranouski [ 24/Jul/14 ]
Alk, I do not agree regarding "Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interface"
I think we shouldn't have possibility to trigger via rest if we can't do it on UI
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Steps don't match that "shouldn't". Feel free to file proper bug for "UI doesn't allow but REST does allow" with all proper details and evidence.




[MB-11805] KV+ XDCR System test: Missing items in bi-xdcr only Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-998

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
Free for investigation. Not attaching data files.

Steps
--------
1a. Load on both clusters till vb_active_resident_items_ratio < 50.
1b. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one
Verify item count

Problem
-------------
standardbucket(C1) <---> standardbucket(C2)
On C1 - 57890744 items
On C2 - 57957032 items
standardbucket1(C1) ----> standardbucket1(C2)
On C1 - 14053020 items
On C2 - 14053020 items

Total number of missing items : 66,288

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.


Attached
-------------
cbcollect and list of keys that are missing on vb0


Missing keys
-------------------
Atleast 50-60 keys missing in every vbucket. Attaching all missing keys from vb0

vb0
-------
{'C1_node:': u'172.23.105.44',
'vb': 0,
'C2_node': u'172.23.105.54',
'C1_key_count': 78831,
 'C2_key_count': 78929,
 'missing_keys': 98}

     id: 06FA8A8B-11_110 deleted, tombstone exists
     id: 06FA8A8B-11_1354 present, report a bug!
     id: 06FA8A8B-11_1426 present, report a bug!
     id: 06FA8A8B-11_2175 present, report a bug!
     id: 06FA8A8B-11_2607 present, report a bug!
     id: 06FA8A8B-11_2797 present, report a bug!
     id: 06FA8A8B-11_3871 deleted, tombstone exists
     id: 06FA8A8B-11_4245 deleted, tombstone exists
     id: 06FA8A8B-11_4537 present, report a bug!
     id: 06FA8A8B-11_662 deleted, tombstone exists
     id: 06FA8A8B-11_6960 present, report a bug!
     id: 06FA8A8B-11_7064 present, report a bug!
     id: 3600C830-80_1298 present, report a bug!
     id: 3600C830-80_1308 present, report a bug!
     id: 3600C830-80_2129 present, report a bug!
     id: 3600C830-80_4219 deleted, tombstone exists
     id: 3600C830-80_4389 deleted, tombstone exists
     id: 3600C830-80_7038 present, report a bug!
     id: 3FEF1B93-91_2890 present, report a bug!
     id: 3FEF1B93-91_2900 present, report a bug!
     id: 3FEF1B93-91_3004 present, report a bug!
     id: 3FEF1B93-91_3194 present, report a bug!
     id: 3FEF1B93-91_3776 deleted, tombstone exists
     id: 3FEF1B93-91_753 present, report a bug!
     id: 52D6D916-120_1837 present, report a bug!
     id: 52D6D916-120_3282 present, report a bug!
     id: 52D6D916-120_3312 present, report a bug!
     id: 52D6D916-120_3460 present, report a bug!
     id: 52D6D916-120_376 deleted, tombstone exists
     id: 52D6D916-120_404 deleted, tombstone exists
     id: 52D6D916-120_4926 present, report a bug!
     id: 52D6D916-120_5022 present, report a bug!
     id: 52D6D916-120_5750 present, report a bug!
     id: 52D6D916-120_594 deleted, tombstone exists
     id: 52D6D916-120_6203 present, report a bug!
     id: 5C12B75A-142_2889 present, report a bug!
     id: 5C12B75A-142_2919 present, report a bug!
     id: 5C12B75A-142_569 deleted, tombstone exists
     id: 73C89FDB-102_1013 present, report a bug!
     id: 73C89FDB-102_1183 present, report a bug!
     id: 73C89FDB-102_1761 present, report a bug!
     id: 73C89FDB-102_2232 present, report a bug!
     id: 73C89FDB-102_2540 present, report a bug!
     id: 73C89FDB-102_4092 deleted, tombstone exists
     id: 73C89FDB-102_4102 deleted, tombstone exists
     id: 73C89FDB-102_668 deleted, tombstone exists
     id: 87B03DB1-62_3369 present, report a bug!
     id: 8DA39D2B-131_1949 present, report a bug!
     id: 8DA39D2B-131_725 deleted, tombstone exists
     id: A2CC835C-00_2926 present, report a bug!
     id: A2CC835C-00_3022 present, report a bug!
     id: A2CC835C-00_3750 present, report a bug!
     id: A2CC835C-00_5282 present, report a bug!
     id: A2CC835C-00_5312 present, report a bug!
     id: A2CC835C-00_5460 present, report a bug!
     id: A2CC835C-00_6133 present, report a bug!
     id: A2CC835C-00_6641 present, report a bug!
     id: A5C9F867-33_1091 present, report a bug!
     id: A5C9F867-33_1101 present, report a bug!
     id: A5C9F867-33_1673 present, report a bug!
     id: A5C9F867-33_2320 present, report a bug!
     id: A5C9F867-33_2452 present, report a bug!
     id: A5C9F867-33_4010 deleted, tombstone exists
     id: A5C9F867-33_4180 deleted, tombstone exists
     id: CD7B0436-153_3638 present, report a bug!
     id: CD7B0436-153_828 present, report a bug!
     id: D94DA3B2-51_829 present, report a bug!
     id: DE161E9D-40_1235 present, report a bug!
     id: DE161E9D-40_1547 present, report a bug!
     id: DE161E9D-40_2014 present, report a bug!
     id: DE161E9D-40_2184 present, report a bug!
     id: DE161E9D-40_2766 present, report a bug!
     id: DE161E9D-40_3880 deleted, tombstone exists
     id: DE161E9D-40_3910 deleted, tombstone exists
     id: DE161E9D-40_4324 deleted, tombstone exists
     id: DE161E9D-40_4456 deleted, tombstone exists
     id: DE161E9D-40_6801 present, report a bug!
     id: DE161E9D-40_6991 present, report a bug!
     id: DE161E9D-40_7095 present, report a bug!
     id: DE161E9D-40_7105 present, report a bug!
     id: DE161E9D-40_940 present, report a bug!
     id: E9F46ECC-22_173 deleted, tombstone exists
     id: E9F46ECC-22_2883 present, report a bug!
     id: E9F46ECC-22_2913 present, report a bug!
     id: E9F46ECC-22_3017 present, report a bug!
     id: E9F46ECC-22_3187 present, report a bug!
     id: E9F46ECC-22_3765 deleted, tombstone exists
     id: E9F46ECC-22_5327 present, report a bug!
     id: E9F46ECC-22_5455 present, report a bug!
     id: E9F46ECC-22_601 deleted, tombstone exists
     id: E9F46ECC-22_6096 present, report a bug!
     id: E9F46ECC-22_6106 present, report a bug!
     id: E9F46ECC-22_6674 present, report a bug!
     id: E9F46ECC-22_791 present, report a bug!
     id: ECD6BE16-113_2961 present, report a bug!
     id: ECD6BE16-113_3065 present, report a bug!
     id: ECD6BE16-113_3687 present, report a bug!
     id: ECD6BE16-113_3717 present, report a bug!

74 undeleted key(s) present on C2(.54) compared to C1(.44)











 Comments   
Comment by Aruna Piravi [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11805/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11805/C2.tar
Comment by Aruna Piravi [ 25/Jul/14 ]
[7/23/14 1:40:12 PM] Aruna Piraviperumal: hi Mike, I see some backfill stmts like in MB-11725 but that doesn't lead to any missing items
[7/23/14 1:40:13 PM] Aruna Piraviperumal: 172.23.105.47
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
 
172.23.105.50


172.23.105.59


172.23.105.62


172.23.105.45
/opt/couchbase/var/lib/couchbase/logs/memcached.log.27.txt:Tue Jul 22 16:02:46.470085 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-2ad6ab49733cf45595de9ee568c05798 - (vb 421) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.48


172.23.105.52
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.44
[7/23/14 1:56:12 PM] Michael Wiederhold: Having one of those isn't necessarily bad. Let me take a quick look
[7/23/14 2:02:49 PM] Michael Wiederhold: Ok this is good. I'll debug it a little bit more. Also, I don't necessarily expect that data loss will always occur because it's possible that the items could have already been replicated.
[7/23/14 2:03:38 PM] Aruna Piraviperumal: ok
[7/23/14 2:03:50 PM] Aruna Piraviperumal: I'm noticing data loss on standard bucket though
[7/23/14 2:04:19 PM] Aruna Piraviperumal: but no such disk snapshot logs found for 'standardbucket'
Comment by Mike Wiederhold [ 25/Jul/14 ]
For vbucket 0 in the logs I see that on the source side we have high seqno 102957, but on the destination we only have up to seqno 97705 so it appears that some items were not sent to the remote side. I also see in the logs that xdcr did request those items as shown in the log messages below.

memcached<0.78.0>: Wed Jul 23 12:30:02.506513 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 95291 and end seqno 0
memcached<0.78.0>: Wed Jul 23 13:30:01.683760 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) stream created with start seqno 95291 and end seqno 102957
memcached<0.78.0>: Wed Jul 23 13:30:02.070134 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) Stream closing, 0 items sent from disk, 7666 items sent from memory, 102957 was last seqno sent
[ns_server:info,2014-07-23T13:30:10.753,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Wed Jul 23 13:30:10.552586 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 102957 and end seqno 0
Comment by Mike Wiederhold [ 25/Jul/14 ]
Alk,

See my comments above. Can you verify that all items were sent by the xdcr module correctly?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Let me quickly note that .tar is again in fact .tar.gz.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
missing:

a) data files (so that I can double-check your finding)

b) xdcr traces
Comment by Aruna Piravi [ 25/Jul/14 ]
1. For system tests, data files are huge, I did not attach them, the cluster is available.
2. xdcr traces were not enabled for this run, my apologies but we discard all info we have in hand? Another complete run will take 3 days. I'm not sure if we want to delay investigation for that long.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
There's no way to investigate such delicate issue without having at least traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
If all files are large you can at least attach that vbucket 0 where you found discrepancies.
Comment by Aruna Piravi [ 25/Jul/14 ]
> There's no way to investigate such delicate issue without having at least traces.
If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

>If all files are large you can at least attach that vbucket 0 where you found discrepancies.
 I can, if requested. The cluster was anyway left available.

Fine, let me do another run if there's no way to work around not having traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
>> > There's no way to investigate such delicate issue without having at least traces.

>> If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

Not possible. We log potentially critical information. But _your_ tests are all semi-automated right? So for your automation it makes sense indeed to always enable xdcr tracing.
Comment by Aruna Piravi [ 25/Jul/14 ]
System test is completely automated. Only the post-test verification is not. But enabling tracing is now a part of the framework.




[MB-11804] [Windows] Memcached error #132 'Internal error': Internal error for vbucket... when set key to bucket Created: 23/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: Zip Archive 172.23.107.124-7232014-1631-diag.zip     Zip Archive 172.23.107.125-7232014-1633-diag.zip     Zip Archive 172.23.107.126-7232014-1634-diag.zip     Zip Archive 172.23.107.127-7232014-1635-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build from centos build. http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-999-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Test warmup test in build 3.0.0-999 on 4 nodes windows 2008 R2 64-bit
python testrunner.py -i ../../ini/4-w-sanity-new.ini -t warmupcluster.WarmUpClusterTest.test_warmUpCluster,num_of_docs=100

The test failed when it loaded keys to bucket default. This test passed in both centos 6.4 and ubuntu 12.04 64-bit





[MB-11803] {UPR}:: Rebalance-out failing due to bad replicators Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.5.2.13
10.5.2.14
10.5.2.15
10.3.121.63
10.3.121.64
10.3.121.66
10.3.121.69

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Tested on 1011, 1005, both ubuntu and centos are seeing this issue

1. Create a 7 node cluster
2. Create a default bucket
3. Add 100 K items
4. Rebalance-out 1 Node (10.3.121.69)
5. Do Ops for Gets

Step 4 and Step 5 act in parallel.

Rabalance exits with the following error::

Bad replicators after rebalance:
Missing = [{'ns_1@10.3.121.63','ns_1@10.3.121.64',0},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',1},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',2},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',3},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',56},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',4},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',5},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',6},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',57},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',58},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',19},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',20},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',21},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',22},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',59},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',23},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',24},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',25},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',60},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',26},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',29},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',30},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',31},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',61},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',38},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',39},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',40},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',62},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',41},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',42},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',43},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',63},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',44},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',47},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',48},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',49},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',64},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',65},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',74},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',75},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',76},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',66},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',77},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',78},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',79},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',67},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',80},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',81},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',82},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',83},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',68},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',92},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',93},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',94},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',69},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',95},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',96},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',97},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',71},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',110},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',111},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',112},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',72},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',113},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',114},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',115},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',73},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',116},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',117},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',118}]
Extras = []

Test Case:: ./testrunner -i centos.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_get_random_key,nodes_out=1,items=100000,value_size=256,skip_cleanup=True,GROUP=OUT;BASIC;P0;FROM_2_0

Will attach logs asap

 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
This is relatively easily reproducible on cluster_run. I'm seeing upr disconnects which explain bad_replicas.

Might be duplicate of another upr disconnects bug.
Comment by Parag Agarwal [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11803/logs.tar.gz
Comment by Mike Wiederhold [ 23/Jul/14 ]
http://review.couchbase.org/#/c/39760/
Comment by Parag Agarwal [ 23/Jul/14 ]
Does not repro in 1014




[MB-11802] [BUG BASH] Sample Bug Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Don Pinto Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Sample test bug for bug bash - Ignore




[MB-11801] It takes almost 2x more time to rebalance 10 empty buckets Created: 23/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-881

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File reb_empty.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/400/artifact/
Is this a Regression?: Yes

 Description   
Rebalance-in, 3 -> 4, 10 empty buckets

There was only one change:
http://review.couchbase.org/#/c/34501/




[MB-11800] cbworkloadgen failed to run in rhel 6.5 Created: 23/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0.1, 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Cédric Delgehier Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Red Hat Enterprise Linux Server release 6.5 (Santiago)
kernel 2.6.32-431.20.3.el6.x86_64

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
After installing Couchbase,

I tried a cbworkloadgen, but I get an error :

{noformat}
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen --version
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen -n localhost:8091
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
{noformat}

Versions tested:
couchbase-server-2.0.1-170.x86_64
couchbase-server-2.5.1-1083.x86_64

 Comments   
Comment by Bin Cui [ 23/Jul/14 ]
First, please check if libcouchstore.so is under /opt/couchbase/lib. If yes, please check if the following python script can run correctly

import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)
Comment by Bin Cui [ 23/Jul/14 ]
The problem is possibly caused by wrong permission for ctypes module.

http://review.couchbase.org/#/c/39764/
Comment by Cédric Delgehier [ 23/Jul/14 ]
[root@rhel65_64 ~]# ls -al /opt/couchbase/lib/libcouchstore.so
lrwxrwxrwx 1 bin bin 22 Jul 22 14:51 /opt/couchbase/lib/libcouchstore.so -> libcouchstore.so.1.0.0

---

[root@rhel65_64 ~]# cat test.py
#!/usr/bin/env python
# -*-python-*-

import traceback, sys
import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)

[root@rhel65_64 ~]# python test.py
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    _lib = ctypes.CDLL(lib)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory

---

[root@rhel65_64 ~]# python -c "import sys; print sys.version_info[1]"
6

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysqlite2
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2

[root@rhel65_64~]# ls -al /opt/couchbase/lib/python/pysqlite2/*
-rw----r-- 1 1001 1001 2624 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dbapi2.py
-rw------- 1 root root 2684 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/dbapi2.pyc
-rw----r-- 1 1001 1001 2350 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dump.py
-rw----r-- 1 1001 1001 1020 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/__init__.py
-rw------- 1 root root 134 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/__init__.pyc
-rwx---r-- 1 1001 1001 1253220 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/_sqlite.so

/opt/couchbase/lib/python/pysqlite2/test:
total 120
drwx---r-- 3 1001 1001 4096 Jul 22 14:52 .
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 ..
-rw----r-- 1 1001 1001 29886 Jul 22 14:52 dbapi.py
-rw----r-- 1 1001 1001 1753 Jul 22 14:52 dump.py
-rw----r-- 1 1001 1001 7942 Jul 22 14:52 factory.py
-rw----r-- 1 1001 1001 6569 Jul 22 14:52 hooks.py
-rw----r-- 1 1001 1001 1966 Jul 22 14:52 __init__.py
drwx---r-- 2 1001 1001 4096 Jul 22 14:52 py25
-rw----r-- 1 1001 1001 10443 Jul 22 14:52 regression.py
-rw----r-- 1 1001 1001 7356 Jul 22 14:52 transactions.py
-rw----r-- 1 1001 1001 15200 Jul 22 14:52 types.py
-rw----r-- 1 1001 1001 13217 Jul 22 14:52 userfunctions.py

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysnappy2_24
ls: cannot access /opt/couchbase/lib/python/pysnappy2_24: No such file or directory
[root@rhel65_64~]# locate pysnappy
[root@rhel65_64~]#

---

As an indication, for version 4:

[root@rhel65_64~]# ls -al /usr/lib64/python2.6/lib-dynload/_ctypes.so
-rwxr-xr-x 1 root root 123608 Nov 21 2013 /usr/lib64/python2.6/lib-dynload/_ctypes.so
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/
drwxr-xr-x. 3 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/*
-rw-r--r-- 1 root root 2041 Nov 22 2010 /usr/lib64/python2.6/ctypes/_endian.py
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyc
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyo
-rw-r--r-- 1 root root 17004 Nov 22 2010 /usr/lib64/python2.6/ctypes/__init__.py
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyc
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyo
drwxr-xr-x. 2 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/macholib
-rw-r--r-- 1 root root 8531 Nov 22 2010 /usr/lib64/python2.6/ctypes/util.py
-rw-r--r-- 1 root root 8376 Mar 20 2010 /usr/lib64/python2.6/ctypes/util.py.binutils-no-dep
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyc
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyo
-rw-r--r-- 1 root root 5349 Nov 22 2010 /usr/lib64/python2.6/ctypes/wintypes.py
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyc
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyo



Comment by Bin Cui [ 24/Jul/14 ]
Check if we support rhel 6.5 or not
Comment by Cédric Delgehier [ 24/Jul/14 ]
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#supported-platforms
Comment by Cédric Delgehier [ 25/Jul/14 ]
So if I understand the implied, you tell me to do a rollback of the security patches until version 6.3, is that it?




[MB-11799] Bucket compaction causes massive slowness of UPR consumers Created: 23/Jul/14  Updated: 26/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.




[MB-11797] Rebalance-out hangs during Rebalance + Views operation in DGM run Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_02--view_query_extended-P1/145/consoleFull

Test to Reproduce:
./testrunner -i /tmp/ubuntu12-view6node.ini get-delays=True,get-cbcollect-info=True -t view.createdeleteview.CreateDeleteViewTests.incremental_rebalance_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction

Steps to Reproduce:
1. Setup 5-node cluster
2. Create default bucket
3. Load 200000 items
4. Load bucket to achieve dgm 10%
5. Create Views
6. Start ddoc + Rebalance out operations in parallel

Please refer attached log file "logs.txt".

Uploading Logs:


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/8586d8eb/172.23.106.201-7222014-2350-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/ea5d5a3f/172.23.106.199-7222014-2354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d06d7861/172.23.106.200-7222014-2355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/65653f65/172.23.106.198-7222014-2353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/dd05a054/172.23.106.197-7222014-2352-diag.zip
Comment by Sriram Melkote [ 23/Jul/14 ]
Nimish - to my eyes, it looks like views are not involved in this failure. Can you please take a look at the detailed log and assign to Alk if you agree? Thanks
Comment by Nimish Gupta [ 23/Jul/14 ]
From the logs:

[couchdb:info,2014-07-22T14:47:21.345,ns_1@172.23.106.199:<0.17993.2>:couch_log:info:39]Set view `default`, replica (prod) group `_design/dev_ddoc40`, signature `c018b62ae9eab43522a3d0c43ac48b3e`, terminating with reason: {upr_died,
                                                                                                                                       {bad_return_value,
                                                                                                                                        {stop,
                                                                                                                                         sasl_auth_failed}}}

One obvious problem is that we returned the wrong number of parameter for stop when sasl auth failed. That I have fixed, and is under review.(http://review.couchbase.org/#/c/39735/).

I don't know the reason why sasl auth failed or it may be normal for sasl auth to fail during rebalance. Meenakshi, could you please run the test again after this change is merged.
Comment by Nimish Gupta [ 23/Jul/14 ]
Trond has added code to log more information for sasl errors in memcached (http://review.couchbase.org/#/c/39738/). It will be helpful to debug sasl errors.
Comment by Meenakshi Goel [ 24/Jul/14 ]
Issue is reproducible with latest build 3.0.0-1020-rel.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/99/consoleFull
Uploading Logs shortly.
Comment by Meenakshi Goel [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/13f68e9c/172.23.106.186-7242014-1238-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/c0cf8496/172.23.106.187-7242014-1239-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/77b2fb50/172.23.106.188-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d0335545/172.23.106.189-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/7634b520/172.23.106.190-7242014-1241-diag.zip
Comment by Nimish Gupta [ 24/Jul/14 ]
From the ns_server logs, It looks to me memcached has crashed.

[error_logger:error,2014-07-24T12:28:36.305,ns_1@172.23.106.186:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.693.0>
    registered_name: []
    exception exit: {badmatch,{error,closed}}
      in function gen_server:init_it/6 (gen_server.erl, line 328)
    ancestors: ['single_bucket_sup-default',<0.675.0>]
    messages: []
    links: [<0.717.0>,<0.719.0>,<0.720.0>,<0.277.0>,<0.676.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 75113
    stack_size: 27
    reductions: 26397931
  neighbours:

Ep-engine/ns_server team please take a look.
Comment by Nimish Gupta [ 24/Jul/14 ]
From the logs:

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.31044.9>,
           {{{badmatch,{error,closed}},
             {gen_server,call,
                 ['ns_memcached-default',
                  {get_dcp_docs_estimate,321,
                      "replication:ns_1@172.23.106.187->ns_1@172.23.106.188:default"},
                  180000]}},
            {gen_server,call,
                [{'janitor_agent-default','ns_1@172.23.106.187'},
                 {if_rebalance,<0.15733.9>,
                     {wait_dcp_data_move,['ns_1@172.23.106.188'],321}},
                 infinity]}}}}
Comment by Sriram Melkote [ 25/Jul/14 ]
Alk, can you please take a look? Thanks!
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Quick hint for fellow coworkers. When you see connection closed usually first thing to check is if memcached has crashed. And in this case indeed it has (diag's cluster wide logs is perfect place to find this issues):

2014-07-24 12:28:35.861 ns_log:0:info:message(ns_1@172.23.106.186) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:09:47.941525 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 650) stream created with start seqno 5794 and end seqno 18446744073709551615
Thu Jul 24 12:09:49.115570 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 749, cookie 0x606f800
Thu Jul 24 12:09:49.380310 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 648, cookie 0x6070d00
Thu Jul 24 12:09:49.450869 PDT 3: (default) UPR (Consumer) eq_uprq:replication:ns_1@172.23.106.189->ns_1@172.23.106.186:default - (vb 648) Attempting to add takeover stream with start seqno 5463, end seqno 18446744073709551615, vbucket uuid 35529072769610, snap start seqno 5463, and snap end seqno 5463
Thu Jul 24 12:09:49.495674 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 648) stream created with start seqno 5463 and end seqno 18446744073709551615
2014-07-24 12:28:36.302 ns_memcached:0:info:message(ns_1@172.23.106.186) - Control connection to memcached on 'ns_1@172.23.106.186' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_memcached:0:info:message(ns_1@172.23.106.187) - Control connection to memcached on 'ns_1@172.23.106.187' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_log:0:info:message(ns_1@172.23.106.187) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:28:35.860224 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1019) Stream closing, 0 items sent from disk, 0 items sent from memory, 5781 was last seqno sent
Thu Jul 24 12:28:35.860235 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1020) Stream closing, 0 items sent from disk, 0 items sent from memory, 5879 was last seqno sent
Thu Jul 24 12:28:35.860246 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1021) Stream closing, 0 items sent from disk, 0 items sent from memory, 5772 was last seqno sent
Thu Jul 24 12:28:35.860256 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1022) Stream closing, 0 items sent from disk, 0 items sent from memory, 5427 was last seqno sent
Thu Jul 24 12:28:35.860266 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1023) Stream closing, 0 items sent from disk, 0 items sent from memory, 5480 was last seqno sent

Status 137 is 128 (death by signal (set by kernel)) + 9. So signal 9. dmesg (captured in couchbase.log) does not have signs of OOM. This means - humans :) Not the first and sadly not the last time something like this happens. Rogue scripts, bad tests etc.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Also we should stop the practice if reusing tickets for unrelated conditions. This doesn't look anywhere close to rebalance hang isnt?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Not sure what to do about this one. Closing as incomplete will probably not hurt.




[MB-11796] Rebalance after manual failover hangs (delta recovery) Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb11.log     Text File gdb12.log     Text File gdb13.log     Text File gdb14.log     Text File master_events.log    
Issue Links:
Duplicate
duplicates MB-11768 movement of 27 empty replica vbuckets... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.11.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.12.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.13.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.14.zip
Is this a Regression?: Yes

 Description   
1 of 4 nodes is being re-added after failover.
100M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Warmup is completed but rebalance hangs afterwards.

 Comments   
Comment by Sriram Ganesan [ 23/Jul/14 ]
I see the following log messages

Tue Jul 22 23:16:44.367356 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.11->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367363 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.14->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367376 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.13->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds

I also see messages like this

Wed Jul 23 02:30:49.306705 PDT 3: 155 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310060 PDT 3: 144 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310273 PDT 3: 152 Closing connection due to read error: Connection reset by peer

The first set of the messages could be a bug in UPR that could be causing the disconnections and the second set of the messages could be because we are trying to read from a disconnected socket. Interestingly, a fix was merged for bug MB-11803 (http://review.couchbase.org/#/c/39760/) in the UPR noop area recently. It might be a good idea to run this test with that fix to see if that could address the problem.

I don't see any of the above error messages in the logs of MB-11768. So, the seqnoWaitingStarted in this case could be different from the one in MB-11768 assuming that the fix for MB-11803 solves this problem.

Comment by Pavel Paulau [ 24/Jul/14 ]
Indeed, that fix helped.




[MB-11795] Rebalance exited with reason {unexpected_exit, {'EXIT',<0.27836.0>,{bulk_set_vbucket_state_failed...} Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005-rel

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/126/consoleFull

Test to Reproduce:
./testrunner -i myfile.ini get-cbcollect-info=True,get-logs=True, -t view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=3,num_views_per_ddoc=2,items=200000,sasl_buckets=1

Steps to Reproduce:
1. Setup a 4-node cluster
2. Create 1 default and 1 sasl bucket
3. Rebalance in 2 nodes
4. Start Rebalance

Logs:

[user:info,2014-07-23T2:25:43.220,ns_1@172.23.107.24:<0.1154.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {unexpected_exit,
                              {'EXIT',<0.27836.0>,
                               {bulk_set_vbucket_state_failed,
                                [{'ns_1@172.23.107.24',
                                  {'EXIT',
                                   {{{{{badmatch,
                                        [{<0.27848.0>,
                                          {done,exit,
                                           {normal,
                                            {gen_server,call,
                                             [<0.14598.0>,
                                              {setup_streams,
                                               [684,692,695,699,704,705,706,
                                                707,708,709,710,711,712,713,
                                                714,715,716,717,718,719,720,
                                                721,722,723,724,725,726,727,
                                                728,729,730,731,732,733,734,
                                                735,736,737,738,739,740,741,
                                                742,743,744,745,746,747,748,
                                                749,750,751,752,753,754,755,
                                                756,757,758,759,760,761,762,
                                                763,764,765,766,767,768,769,
                                                770,771,772,773,774,775,776,
                                                777,778,779,780,781,782,783,
                                                784,785,786,787,788,789,790,
                                                791,792,793,794,795,796,797,
                                                798,799,800,801,802,803,804,
                                                805,806,807,808,809,810,811,
                                                812,813,814,815,816,817,818,
                                                819,820,821,822,823,824,825,
                                                826,827,828,829,830,831,832,
                                                833,834,835,836,837,838,839,
                                                840,841,842,843,844,845,846,
                                                847,848,849,850,851,852,853]},
                                              infinity]}},
                                           [{gen_server,call,3,
                                             [{file,"gen_server.erl"},
                                              {line,188}]},
                                            {upr_replicator,
                                             '-spawn_and_wait/1-fun-0-',1,
                                             [{file,"src/upr_replicator.erl"},
                                              {line,195}]}]}}]},
                                       [{misc,
                                         sync_shutdown_many_i_am_trapping_exits,
                                         1,
                                         [{file,"src/misc.erl"},{line,1429}]},
                                        {upr_replicator,spawn_and_wait,1,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,217}]},
                                        {upr_replicator,handle_call,3,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,112}]},
                                        {gen_server,handle_msg,5,
                                         [{file,"gen_server.erl"},{line,585}]},
                                        {proc_lib,init_p_do_apply,3,
                                         [{file,"proc_lib.erl"},{line,239}]}]},
                                      {gen_server,call,
                                       ['upr_replicator-bucket0-ns_1@172.23.107.26',
                                        {setup_replication,
                                         [684,692,695,699,704,705,706,707,708,
                                          709,710,711,712,713,714,715,716,717,
                                          718,719,720,721,722,723,724,725,726,
                                          727,728,729,730,731,732,733,734,735,
                                          736,737,738,739,740,741,742,743,744,
                                          745,746,747,748,749,750,751,752,753,
                                          754,755,756,757,758,759,760,761,762,
                                          763,764,765,766,767,768,769,770,771,
                                          772,773,774,775,776,777,778,779,780,
                                          781,782,783,784,785,786,787,788,789,
                                          790,791,792,793,794,795,796,797,798,
                                          799,800,801,802,803,804,805,806,807,
                                          808,809,810,811,812,813,814,815,816,
                                          817,818,819,820,821,822,823,824,825,
                                          826,827,828,829,830,831,832,833,834,
                                          835,836,837,838,839,840,841,842,843,
                                          844,845,846,847,848,849,850,851,852,
                                          853]},
                                        infinity]}},
                                     {gen_server,call,
                                      ['replication_manager-bucket0',
                                       {change_vbucket_replication,684,
                                        'ns_1@172.23.107.26'},
                                       infinity]}},
                                    {gen_server,call,
                                     [{'janitor_agent-bucket0',
                                       'ns_1@172.23.107.24'},
                                      {if_rebalance,<0.1353.0>,
                                       {update_vbucket_state,684,replica,
                                        undefined,'ns_1@172.23.107.26'}},
                                      infinity]}}}}]}}}

Uploading Logs


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11795/f9ad56ee/172.23.107.24-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/07e24114/172.23.107.25-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/a9c9a36d/172.23.107.26-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/2517f70b/172.23.107.27-diag.zip
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
Seeing downstream (upr replicator to upr consumer) connection being closed.

Possibly due to this message

Wed Jul 23 02:25:43.080600 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - (vb 684) Attempting to add stream with start seqno 0, end seqno 18446744073709551615, vbucket uuid 139895607874175, snap start seqno 0, and snap end seqno 0
Wed Jul 23 02:25:43.080642 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - Disconnecting because noop message has no been received for 40 seconds
Wed Jul 23 02:25:43.082958 PDT 3: (bucket0) UPR (Producer) eq_uprq:replication:ns_1@172.23.107.24->ns_1@172.23.107.25:bucket0 - (vb 359) Stream closing, 0 items sent from disk, 0 items sent from memory, 0 was last seqno sent

This is on .24.

Appears related to yesterday's fix to detect disconnects on consumer side.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
CC-ed Chiyoung and optimistically passed this to Mike due to apparent relation to fix made (AFAIK) by Mike.
Comment by Mike Wiederhold [ 23/Jul/14 ]
Duplicate of MB-18003
Comment by Ketaki Gangal [ 24/Jul/14 ]
MB-11803* ?
Comment by Chiyoung Seo [ 24/Jul/14 ]
Ketaki,

Yes, it's MB-11803.




[MB-11794] Creating 10 buckets causes memcached segmentation fault Created: 23/Jul/14  Updated: 26/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-998

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/396/artifact/
Is this a Regression?: Yes

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Sundar,

The backtrace indicates that it is mostly a regression from the vbucket-level lock change for flusher, vb snapshot, compaction, and vbucket deletion task, which we made recently.
Comment by Pavel Paulau [ 23/Jul/14 ]
The same issue happened with single bucket. The problem seems rather common.
Comment by Sundar Sridharan [ 24/Jul/14 ]
Found root cause - cachedVBStates is not preallocated and is modified in a thread unsafe manner. This regression shows up now because we have more parallelism with vbucket-level locking. Working on the fix.
Comment by Sundar Sridharan [ 24/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/#/c/39834/ thanks
Comment by Chiyoung Seo [ 24/Jul/14 ]
The fix was merged.




[MB-11793] Build breakage in upr-consumer.cc Created: 22/Jul/14  Updated: 23/Jul/14  Due: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: .master
Fix Version/s: .master
Security Level: Public

Type: Task Priority: Test Blocker
Reporter: Chris Hillery Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: ep-engine
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Commit 8d636bbb02b0338df9e73c2573422b6463feb92d to ep-engine appears to be breaking the build on most platforms, eg:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-master-builder/builds/890/steps/couchbase-server%20make%20enterprise%20/logs/stdio

 Comments   
Comment by Mike Wiederhold [ 23/Jul/14 ]
Just want to note here that this does not affect 3.0 builds in case anyone is looking at the ticket. The merge of the memcached 3.0 branch is linked below. Since I don't think anyone is working on the master branch I'm going to wait for someone to review the change.

http://review.couchbase.org/#/c/39708/




[MB-11792] Link in readme file in 3.0 does not work correctly Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Links still show 2.5 doc in browser

http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-network-ports.html
http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-bestpractice.html
http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-redhat.html
 http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-ubuntu.html

 Comments   
Comment by Thuan Nguyen [ 22/Jul/14 ]
Tested on build 3.0.0-973
Comment by Amy Kurtzman [ 22/Jul/14 ]
Those links are not correct for the documentation. Also, the 3.0 beta documentation hasn't been published yet and they wouldn't work yet even if they were correct.

Docs are located at http://docs.couchbase.com/
(they are not at www.couchbase.com)




[MB-11791] README in ubuntu 12.04 image shows incorrect information Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Attachments: Text File README_Linux_3.0_beta.txt     Text File README_Mac_3.0_beta.txt     Text File README_Windows_3.0_beta.txt    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
In README.txt in couchbase server version 3.0.0-973 for ubuntu 12.04, it shows incorrect information.
Couchbase server for ubuntu 12.04 does not depend on libssl0.9.8 as show in README.txt

root@ubuntu:~# ls /tmp/
couchbase-server-enterprise_ubuntu_1204_x86_64_3.0.0-973-rel.deb ssh-ZnnQQUv795 vmware-root
root@ubuntu:~# more /opt/couchbase/README.txt
Couchbase Server 3.0.0, Ubuntu and Centos

Couchbase Server is a distributed NoSQL document database for interactive applications. Its scale-out architecture runs in the cloud or on commodity hardware and provides a flexible data model, consistent high-performa
nce, easy scalability and always-on 24x365 availability. This release contains fixes as well as new features and functionality, including:

- Multiple Readers and Writers threads for more rapid persistence onto disk
-'Optimistic Replication' to improve latency when you replicate documents via XDCR
- More XDCR Statistics to monitor performance and behavior of XDCR
- Detailed Rebalance Report to show actual number of buckets and keys that have been transferred to other nodes in a cluster
- Transfer, Backup and Restore can be done for design documents only. You do not need to include data. The default behavior is to transfer both data and design documents.
- Hostname Management provided as easy to use interfaces in Web Console and Installation Wizard
- Command Line tools updated so you can manage nodes, buckets, clusters and XDCR
- Upload CSV files into Couchbase with cbtransfer

For more information, see our Release Notes: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-server-rn.html

REQUIREMENTS

- For Ubuntu platforms you will need a OpenSSL dependency or your server will not run. Do the following:

    root-shell> apt-get install libssl0.9.8

    OpenSSL is already included with Centos

- To run cbcollect_info you must have administrative privileges

INSTALL

Centos: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-redhat.html

Ubuntu: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-ubuntu.html

By default we install Couchbase Server at /opt/couchbase

The server will automatically start after install and will be available by default on port 8091

For a full list of network ports for Couchbase Server, see http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-network-ports.html

To read more about Couchbase Server best practices, see http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-bestpractice.html



 Comments   
Comment by Ruth Harris [ 22/Jul/14 ]
Some of this content looks really really old. I'm attaching READMEs for all 3 operating systems.

Comment by Ruth Harris [ 22/Jul/14 ]
All 3 operating systems.




[MB-11790] couchbase-cli helps does not show https in uploadHost in cluster-wide collectinfo (only https protocol supported) Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Tested build 3.0.0-1001

Start cluster-wide log collection for whole cluster
    couchbase-cli collect-logs-start -c 192.168.0.1:8091 \
        -u Administrator -p password \
        --all-nodes --upload --upload-host=host.upload.com \
        --customer="example inc" --ticket=12345

 




[MB-11789] couchbase-cli helps should give example how to collect some nodes in cluster-wide collectinfo Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
couchbase-cli help does not show how to do cluster wide collectinfo in some nodes, not all nodes

Start cluster-wide log collection for whole cluster
    couchbase-cli collect-logs-start -c 192.168.0.1:8091 \
        -u Administrator -p password \
        --all-nodes --upload --upload-host=host.upload.com \
        --customer="example inc" --ticket=12345

  Stop cluster-wide log collection
    couchbase-cli collect-logs-stop -c 192.168.0.1:8091 \
        -u Administrator -p password

  Show status of cluster-wide log collection
    couchbase-cli collect-logs-status -c 192.168.0.1:8091 \
        -u Administrator -p password




[MB-11788] [ui] getting incorrect ejection policy update warning when simply updating bucket's quota Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
SUBJ.

1. Create bucket aaa

2. Open bucket aaa Edit dialog

3. Change quota and hit enter

4. Observe modal popup warning that should not be there. I.e. we're not updating any property that would require bucket restart but we're getting warning.





[MB-11787] couchbase-cli should validate host before running cluster-wide collection Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase 3.0.0-999 on one ubuntu 12.04 node
Do cluster-wide collectinfo with upload option.

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --allnodes --upload --upload-host=http://abcnn.com --customer=1234 --ticket=

couchbase-cli collect-logs-start did not validate valid host (https) before running collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-status -c 127.0.0.1:8091 -u Administrator -p password
Status: running
Details:
Node: ns_1@127.0.0.1
Status: started
path : /opt/couchbase/var/lib/couchbase/tmp/collectinfo-2014-07-22T213346-ns_1@127.0.0.1.zip






[MB-11786] {UPR}:: Rebalance-out hangs due to indexing stuck Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Seeing this issue in 991

1. Create 7 node cluster (10.6.2.144-150)
2. Create default Bucket
3. Add 1K items
4. Create 5 views and query
5. Rebalance out node 10.6.2.150

Step 4 and 5 are run in parallel

We see the rebalance hanging

I am seeing the following issue couchdb log in 10.6.2.150

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

Stacktrace: [{couch_index_merger,query_index,3,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_index_merger/src/couch_index_merger.erl"},
                  {line,75}]},
             {couch_httpd,handle_request,6,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couchdb/couch_httpd.erl"},
                  {line,222}]},
             {mochiweb_http,headers,5,


Will attach logs ASAP

Test Case:: ./testrunner -i ubuntu_x64--109_00--Rebalance-Out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_queries,nodes_out=1,blob_generator=False,value_size=1024,GROUP=OUT;BASIC;P0;FROM_2_0

 Comments   
Comment by Parag Agarwal [ 22/Jul/14 ]
The cluster is live if you want to investigate 10.6.2.144-150.
Comment by Parag Agarwal [ 22/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11786/991_logs.tar.gz
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
We're waiting for index to become updated.

I.e. I see a number of this:

     {<17674.13818.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007f64917effa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f6493d4f070 Return addr 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.9.246202>">>,<<"y(1) infinity">>,
                   <<"y(2) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.11899.5>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007f6493d4f0a8 Return addr 0x00007f6444879940 (janitor_agent:wait_index_updated/5 + 432)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(2) {'janitor_agent-default','ns_1@10.6.2.144'}">>,
                   <<"y(3) Catch 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d0 Return addr 0x00007f6444a49ea8 (ns_single_vbucket_mover:'-wait_index_updated/5-fun-0-'/5 + 104)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d8 Return addr 0x00007f64917f38a0 (proc_lib:init_p/3 + 688)">>,
                   <<>>,
                   <<"0x00007f6493d4f0e0 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007f64917f38c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,2}]},
       {heap_size,610},
       {total_heap_size,1597},
       {links,[<17674.13242.5>]},
       {memory,13688},
       {message_queue_len,0},
       {reductions,806},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And this:
     {<0.13891.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f64448ad040 (capi_set_view_manager:'-do_wait_index_updated/4-lc$^0/1-0-'/3 + 64)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f643e3ac948 Return addr 0x00007f64448abb90 (capi_set_view_manager:do_wait_index_updated/4 + 848)">>,
                   <<"y(0) #Ref<0.0.9.246814>">>,
                   <<"y(1) #Ref<0.0.9.246821>">>,
                   <<"y(2) #Ref<0.0.9.246820>">>,<<"y(3) []">>,<<>>,
                   <<"0x00007f643e3ac970 Return addr 0x00007f64917f3ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) {<0.13890.5>,#Ref<0.0.9.246813>}">>,<<>>,
                   <<"0x00007f643e3ac980 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f64917f3ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,5}]},
       {heap_size,987},
       {total_heap_size,1974},
       {links,[]},
       {memory,16808},
       {message_queue_len,0},
       {reductions,1425},
       {trap_exit,false}]}
Comment by Parag Agarwal [ 22/Jul/14 ]
Still seeing the issue in 3.0.0-1000, centos 6x, ubuntu 1204
Comment by Sriram Melkote [ 22/Jul/14 ]
Sarath, can you please take a look?
Comment by Nimish Gupta [ 22/Jul/14 ]
The error in http query will not hang the rebalance. Http query error is happening since ddoc is updated.
I see there is error in getting mutation for partition 127 from ep-engine :

[couchdb:info,2014-07-22T13:37:59.764,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.866,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.967,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.070,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.171,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.272,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.373,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.474,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.575,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.676,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.777,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.878,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.979,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...

There are lot of above continuous message till the logs are collected.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Yes, ep-engine kept on returning ETMPFAIL for partition 127's stream request. Hence, indexing never progressed.
EP-Engine team should take a look.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Tue Jul 22 13:52:14.041453 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state
Tue Jul 22 13:52:14.143551 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state

It seems that vbucket 127 is in backfill state and it never gets completed.
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/39896




[MB-11785] mcd aborted in bucket_engine_release_cookie: "es != ((void *)0)" Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64 vb cluster_run -n1

Attachments: Zip Archive collectinfo-2014-07-22T192534-n_0@127.0.0.1.zip    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Observed while running pyupr unit tests against latest from rel-3.0.0 branch.

 After about 20 tests the crash occurred on test_failover_log_n_producers_n_vbuckets. This test passes stand alone so I think it's a matter of running all the tests in succession and then coming across this issue.

backtrace:

Thread 228 (Thread 0x7fed2e7fc700 (LWP 695)):
#0 0x00007fed8b608f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed8b60c388 in __GI_abort () at abort.c:89
#2 0x00007fed8b601e36 in __assert_fail_base (fmt=0x7fed8b753718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fed8949f28c "es != ((void *)0)",
    file=file@entry=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=line@entry=3301,
    function=function@entry=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:92
#3 0x00007fed8b601ee2 in __GI___assert_fail (assertion=0x7fed8949f28c "es != ((void *)0)",
    file=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=3301,
    function=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:101
#4 0x00007fed8949d13d in bucket_engine_release_cookie (cookie=0x5b422e0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:3301
#5 0x00007fed8835343f in EventuallyPersistentEngine::releaseCookie (this=0x7fed4808f5d0, cookie=0x5b422e0)
    at /couchbase/ep-engine/src/ep_engine.cc:1883
#6 0x00007fed8838d730 in ConnHandler::releaseReference (this=0x7fed7c0544e0, force=false)
    at /couchbase/ep-engine/src/tapconnection.cc:306
#7 0x00007fed883a4de6 in UprConnMap::shutdownAllConnections (this=0x7fed4806e4e0)
    at /couchbase/ep-engine/src/tapconnmap.cc:1004
#8 0x00007fed88353e0a in EventuallyPersistentEngine::destroy (this=0x7fed4808f5d0, force=true)
    at /couchbase/ep-engine/src/ep_engine.cc:2034
#9 0x00007fed8834dc05 in EvpDestroy (handle=0x7fed4808f5d0, force=true) at /couchbase/ep-engine/src/ep_engine.cc:142
#10 0x00007fed89498a54 in engine_shutdown_thread (arg=0x7fed48080540)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1564
#11 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480a5b60) at /couchbase/platform/src/cb_pthreads.c:19
#12 0x00007fed8beba182 in start_thread (arg=0x7fed2e7fc700) at pthread_create.c:312
#13 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 226 (Thread 0x7fed71790700 (LWP 693)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093e80, mutex=0x7fed78093e48, ms=720)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78093e40, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78093e40, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78093e40, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78093e40, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480203e0) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71790700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 225 (Thread 0x7fed71f91700 (LWP 692)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093830, mutex=0x7fed780937f8, ms=86390052)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780937f0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780937f0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780937f0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780937f0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801d490) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71f91700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 224 (Thread 0x7fed72792700 (LWP 691)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3894)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801a670) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed72792700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 223 (Thread 0x7fed70f8f700 (LWP 690)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3893)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed48017850) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed70f8f700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 222 (Thread 0x7fed7078e700 (LWP 689)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1672)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b8e90) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed7078e700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 221 (Thread 0x7fed0effd700 (LWP 688)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1673)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b6890) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed0effd700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 210 (Thread 0x7fed0f7fe700 (LWP 661)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed740e8910)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed740667e0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 201 (Thread 0x7fed0ffff700 (LWP 644)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed74135070)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed74050ef0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

---Type <return> to continue, or q <return> to quit---
Thread 192 (Thread 0x7fed2cff9700 (LWP 627)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7c90)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c078340) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2cff9700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 183 (Thread 0x7fed2d7fa700 (LWP 610)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009e000)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5009dfe0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2d7fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 174 (Thread 0x7fed2dffb700 (LWP 593)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009dc30)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed50031010) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2dffb700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 165 (Thread 0x7fed2f7fe700 (LWP 576)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed481cef20)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480921c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 147 (Thread 0x7fed2effd700 (LWP 541)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed540015d0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54057b80) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2effd700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 138 (Thread 0x7fed6df89700 (LWP 523)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed78092aa0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78056ea0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6df89700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 120 (Thread 0x7fed2ffff700 (LWP 489)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7d10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c1b7ac0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 111 (Thread 0x7fed6cf87700 (LWP 472)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5008c030)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500adf50) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6cf87700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 102 (Thread 0x7fed6d788700 (LWP 455)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
---Type <return> to continue, or q <return> to quit---
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080450)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54091560) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6d788700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 93 (Thread 0x7fed6ff8d700 (LWP 438)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080ad0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54068db0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ff8d700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 57 (Thread 0x7fed6e78a700 (LWP 370)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50080230)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5008c360) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6e78a700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 48 (Thread 0x7fed6ef8b700 (LWP 352)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50000c10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500815b0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ef8b700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 39 (Thread 0x7fed6f78c700 (LWP 334)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed4807c290)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
---Type <return> to continue, or q <return> to quit---
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4806e4c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6f78c700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 13 (Thread 0x7fed817fa700 (LWP 292)):
#0 0x00007fed8b693d7d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b6c5334 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x00007fed88386dd2 in updateStatsThread (arg=0x7fed780343f0) at /couchbase/ep-engine/src/memory_tracker.cc:36
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78034450) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed817fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 10 (Thread 0x7fed8aec4700 (LWP 116)):
#0 0x00007fed8b6be6bd in read () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b64d4e0 in _IO_new_file_underflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at fileops.c:613
#2 0x00007fed8b64e46e in __GI__IO_default_uflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at genops.c:435
#3 0x00007fed8b642184 in __GI__IO_getline_info (fp=0x7fed8b992640 <_IO_2_1_stdin_>, buf=0x7fed8aec3e40 "", n=79, delim=10,
    extract_delim=1, eof=0x0) at iogetline.c:69
#4 0x00007fed8b641106 in _IO_fgets (buf=0x7fed8aec3e40 "", n=0, fp=0x7fed8b992640 <_IO_2_1_stdin_>) at iofgets.c:56
#5 0x00007fed8aec5b24 in check_stdin_thread (arg=0x41c0ee <shutdown_server>)
    at /couchbase/memcached/extensions/daemon/stdin_check.c:38
#6 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a66250) at /couchbase/platform/src/cb_pthreads.c:19
#7 0x00007fed8beba182 in start_thread (arg=0x7fed8aec4700) at pthread_create.c:312
#8 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 9 (Thread 0x7fed89ea3700 (LWP 117)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed8a6c3280 <cond>, mutex=0x7fed8a6c3240 <mutex>, ms=19000)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8a4c0fea in logger_thead_main (arg=0x1a66fe0) at /couchbase/memcached/extensions/loggers/file_logger.c:372
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a67050) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed89ea3700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fed89494700 (LWP 135)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9cb0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd0f0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed89494700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
---Type <return> to continue, or q <return> to quit---

Thread 7 (Thread 0x7fed88c93700 (LWP 136)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9da0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd240) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed88c93700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111






Thread 6 (Thread 0x7fed83fff700 (LWP 137)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9e90) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd390) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed83fff700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fed837fe700 (LWP 138)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9f80) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd4e0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed837fe700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fed82ffd700 (LWP 139)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca070) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd630) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed82ffd700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fed827fc700 (LWP 140)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca160) at /couchbase/memcached/daemon/thread.c:277
---Type <return> to continue, or q <return> to quit---
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd780) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed827fc700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fed81ffb700 (LWP 141)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca250) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd8d0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed81ffb700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fed8d764780 (LWP 113)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041d24e in main (argc=3, argv=0x7fff77aaa838) at /couchbase/memcached/daemon/memcached.c:8797

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Abhinav,

The backtrace indicates that the abort crash was caused by closing all the UPR connections during shutdown, which we made some fixes recently.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
Tommie, can you tell how to run these tests, so I could try reproducing on my system?
Comment by Tommie McAfee [ 24/Jul/14 ]
*start a cluster run node then:

git clone https://github.com/couchbaselabs/pyupr.git
cd pyupr
./pyupr -h 127.0.0.1:9000 -b dev


noticed all the tests may pass but memcached can silently abort in the background.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
1. ServerSide: If an upr producer or upr consumer already exists for that cookie, engine should return DISCONNECT: http://review.couchbase.org/#/c/39843
2. py-upr: In the test: test_failover_log_n_producers_n_vbuckets, you are essentially opening 1 connection and sending 1024 open connection messages, so many tests will need changes.
Comment by Chiyoung Seo [ 24/Jul/14 ]
Tommie,

The server side fix was merged.

Can you please fix the issue in the test script and retest it?
Comment by Tommie McAfee [ 25/Jul/14 ]
thanks, working now and affected tests pass with patch:

http://review.couchbase.org/#/c/39878/1




[MB-11784] GUI incorrectly displays vBucket number in stats Created: 22/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Ian McCloy Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 251VbucketDisplay.png     PNG File 3fixVbucketDisplay.png    
Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Many customers are confused and have complained that on the "General Bucket Analytics" / "VBUCKET RESOURCES" page when listing the number of vBuckets, the GUI tries to convert the value 1024 default vBuckets to kilobytes, so it displays as 1.02k vBuckets (screen shot attached) . vBuckets shouldn't be parsed and always should show the full number.

I've changed the javascript to detect for vBuckets values and not parse them, (screen shot attached) . Will amend with a gerrit link when it's pushed to review.

 Comments   
Comment by Ian McCloy [ 22/Jul/14 ]
Code added to gerrit for review -> http://review.couchbase.org/#/c/39668/
Comment by Pavel Blagodov [ 24/Jul/14 ]
Hi Ian, here is clarification:
- kilo (or 'K') is a unit prefix in the metric system denoting multiplication by one thousand.
- kilobyte (or 'KB') is a multiple of the unit byte for digital information.
Comment by Ian McCloy [ 24/Jul/14 ]
Pavel thank you for clearing that up for me. Can you please explain when I see 1.02K vBuckets in the stats is that 1022, 1023 or 1024 active vBuckets, I'm not clear when I look at the UI.
Comment by Pavel Blagodov [ 25/Jul/14 ]
1.02K is expected value because currently UI truncates all analytic stats to three digits. Of course we may increase this number to four digits but this will be working only for K (not for M for example).
Comment by David Haikney [ 25/Jul/14 ]
@Pavel - Yes 1.02k is currently expected but the desire here is to change the UI to show "1024" instead of "1.02K". Fewer characters and more accuracy.




[MB-11783] We need the administrator creds available in isasl.pw Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Major
Reporter: Trond Norbye Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We'd like to add authentication for some of the operations (like setting configuration tunables dynamically). Instead of telling the user to go look in on the system for isasl.pw and dig out the _admin entry and then use that and the generated password, it would be nice if the credentials defined when setting up the cluster could be used.

 Comments   
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Interesting. Very much :)

I cannot do it because we don't have administrator creds anymore. We just have some kind of password hash and that's it.

I admit my fault. I could be more forward looking. But it was somewhat guided by your response back in the day which I interpreted as reluctance to allow memcached auth via admin credentials.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And of course all admin ops to memcached can still be safely and in more controlled way (global if needed, or local if needed) be handled by ns_server.




[MB-11782] Adding Nodes To A Cluster Can Result In Reduced Active Residency Percentages Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Minor
Reporter: Morrie Schreibman Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Customer added 6 nodes to a large cluster in an attempt to increase the overall percentage of active bucket data in cache, and observed that the active bucket residency decreased after rebalancing. Decrease in active data residency after adding nodes and rebalancing turns out to be reproducible.

To reproduce this anomaly, create an 8-node cluster with RAM quota of 100Mbytes per node and populate the default bucket until active percent in memory is about 40%. (I used cbWorkLoadGen and inserted 300K items into the default bucket, specifying an item size of 2K bytes and enabling the -j (JSON) option.) Add 3 nodes to this cluster and rebalance. The resulting default active memory residency percentage will drop significantly and the replica residency percentage will increase. Note that if 3 random nodes are then removed and rebalanced and then added back and rebalanced again, active residency will increase beyond the initial level.

The critical factor in reproducing this anomaly is that the bucket data size must exceed its RAM quota such that the majority of bucket data resides on disk at any given time. When nodes are added to the cluster, the subsequent rebalance results in entire vbuckets read from disk on 1 node and dumped to cache on the receiving node via TAP protocol. Eventually, the node high-water mark will be exceeded and ejections occur. What is consistently observable is that active ejections occur at a greater rate than replica ejections and results in a decreased active bucket residency percentage and an increased replica bucket residency percentage.

Possible workarounds include adding/rebalancing nodes in stages, e.g., instead of adding 6 nodes to a cluster at once, add 3 nodes, rebalance than add 3 more nodes and rebalance again. A 2nd potential workaround would be to alter the default ejection probabilities for replica and active data to reduce the probability of ejecting active data and increase the probability of ejecting replica data. I have not had time to test these possible workarounds.

After discussion in the Support group, our thinking is that any configuration change which is enabled with the intention of improving performance should not result in worsened performance, but that is what can happen in this case. Accordingly we believe that this is a bug and that the rebalancing algorithm should be examined to figure out why - under certain circumstances - rebalancing can cause a higher probability of active data to be ejected .

 Comments   
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Great job finding this out.

But I cannot just go ahead and improve it. And keep in mind that upr might change things a lot both in better and in worse direction.

Eviction is something that I don't any control or much understanding of. I believe you'll need to ask Chiyoung's team to provide some instructions on what to do.

I can only add one guess. If it's related to multiple vbuckets being moved at same time (which might be but it's hard to say how much it contributes), then you will be able to check that by lowering rebalanceMovesBeforeCompaction internal settings.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
With that said, passing to higher levels of engineering suborganization.




[MB-11781] [Incremental offline xdcr upgrade] 2.0.1-170-rel - 3.0.0-973-rel, replica counts are not correct Created: 22/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Upgrade from 2.0.1-170 - 3.0.0-973

Ubuntu 12.04 TLS

Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/2f193298/10.3.3.218-7212014-740-diag.zip
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/86678a9d/10.3.3.218-diag.txt.gz
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11781/a272b793/10.3.3.218-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/46622b23/10.3.3.240-7212014-734-couch.tar.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/6da39af1/10.3.3.240-diag.txt.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11781/702bdaa2/10.3.3.240-7212014-738-diag.zip

[Destination]

10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ae25e869/10.3.3.225-diag.txt.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f44d13a3/10.3.3.225-7212014-734-couch.tar.gz
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11781/f88f7912/10.3.3.225-7212014-743-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/98090b83/10.3.3.239-7212014-734-couch.tar.gz
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/ddb9b54c/10.3.3.239-7212014-741-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11781/e3ac7b07/10.3.3.239-diag.txt.gz
Is this a Regression?: Unknown

 Description   
[Jenkins]
http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/24/consoleFull

[Test]
/testrunner -i ubuntu_x64--36_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.incremental_offline_upgrade,initial_version=2.0.1-170-rel,sdata=False,bucket_topology=default:1>2;bucket0:1><2,upgrade_seq=src><dest


[Test Steps]
1, Installed Source (2 node) and Destination node(2 node) with 2.0.1-170-rel.
2. Change XDCR global settings: xdcrFailureRestartInterval=1, xdcrCheckpointInterval=60 on both cluster.
3. Setup Remote clusters (Bidirectional).

bucket0 <--> bucket0 (Bi-directional) 10.3.3.240 <---> 10.3.3.239
default ---> default (Uni-directional) 10.3.3.240 -----> 10.3.3.239

4. Load 1000 items on each bucket on Source cluster.
5. Load 1000 items on bucket0 on destination cluster.
6. Wait for replication to finish.
7. Offline Upgrade each node one by one to 3.0.0-973 along with load 1000 items on bucket0 and default at Source cluster.
8. Verify items on side.

Expected items on bucket0 - 6000 and default = 5000


[2014-07-21 09:46:45,612] - [task:463] INFO - Saw vb_active_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:45,628] - [data_helper:289] INFO - creating direct client 10.3.3.239:11210 default
[2014-07-21 09:46:45,732] - [data_helper:289] INFO - creating direct client 10.3.3.225:11210 default
[2014-07-21 09:46:45,811] - [task:463] INFO - Saw vb_replica_curr_items 5000 == 5000 expected on '10.3.3.239:8091''10.3.3.225:8091',default bucket
[2014-07-21 09:46:50,832] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:46:55,852] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:00,872] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:05,892] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:10,912] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:15,933] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:20,954] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:25,974] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:30,995] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:36,018] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:41,040] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:46,062] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:51,085] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:47:56,106] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:01,128] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:06,150] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket
[2014-07-21 09:48:11,173] - [task:459] WARNING - Not Ready: vb_replica_curr_items 5987 == 6000 expected on '10.3.3.239:8091''10.3.3.225:8091', bucket0 bucket

8.




[MB-11780] [Upgrade tests] Before upgrade, after configuring XDCR on 2.0.0-1976, node become un-responsive Created: 22/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Version: 2.0.0-1976-rel
Ubuntu 12.04

Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11780/2b77fbd0/10.3.3.218-diag.txt.gz
10.3.3.218 : https://s3.amazonaws.com/bugdb/jira/MB-11780/f4673f14/10.3.3.218-7192014-753-diag.zip
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11780/066eb643/10.3.3.240-diag.txt.gz
10.3.3.240 : https://s3.amazonaws.com/bugdb/jira/MB-11780/d169dc68/10.3.3.240-7192014-753-diag.zip

[Destination]
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11780/481a8e90/10.3.3.225-7192014-754-diag.zip
10.3.3.225 : https://s3.amazonaws.com/bugdb/jira/MB-11780/566270c9/10.3.3.225-diag.txt.gz
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11780/52202c4f/10.3.3.239-7192014-754-diag.zip
10.3.3.239 : https://s3.amazonaws.com/bugdb/jira/MB-11780/7107d70c/10.3.3.239-diag.txt.gz

Issue occurred on 10.3.3.239, which was destination master.
Is this a Regression?: Unknown

 Description   
1. Test failed before upgrade only.
2. Installed Source and Destination nodes 2.0.0-1976-rel
3. Changed global xdcr settings xdcrFailureRestartInterval=1 and xdcrCheckpointInterval=60 on each cluster.
3. Created remote clusters cluster0 and cluster1 for bi-directional XDCR.
4. Node 10.3.3.239 (Destination master) node become un-responsive.

[Notes]
1. Test worked fine for CentOS.
2. 3 tests are failed because of this issue, all are related to 2.0.0-1976-rel upgrade to 3.0.

[Jenkins]
http://qa.hq.northscale.net/job/ubuntu_x64--36_01--XDCR_upgrade-P1/24/consoleFull

[Test]
./testrunner -i ubuntu_x64--36_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-973-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.offline_cluster_upgrade,initial_version=2.0.0-1976-rel,sdata=False,bucket_topology=default:1>2;bucket0:1><2,upgrade_nodes=dest;src,use_encryption_after_upgrade=src;dest


[Failure]
2014-07-19 07:50:28,182] - [basetestcase:264] INFO - sleep for 30 secs. ...
[2014-07-19 07:50:58,191] - [xdcrbasetests:1089] INFO - Setting xdcrFailureRestartInterval to 1 ..
[2014-07-19 07:50:58,204] - [rest_client:1726] INFO - Update internal setting xdcrFailureRestartInterval=1
[2014-07-19 07:50:58,262] - [rest_client:1726] INFO - Update internal setting xdcrFailureRestartInterval=1
[2014-07-19 07:50:58,263] - [xdcrbasetests:1089] INFO - Setting xdcrCheckpointInterval to 60 ..
[2014-07-19 07:50:58,278] - [rest_client:1726] INFO - Update internal setting xdcrCheckpointInterval=60
[2014-07-19 07:50:58,382] - [rest_client:1726] INFO - Update internal setting xdcrCheckpointInterval=60
[2014-07-19 07:50:58,392] - [rest_client:828] INFO - adding remote cluster hostname:10.3.3.239:8091 with username:password Administrator:password name:cluster1 to source node: 10.3.3.240:8091
[2014-07-19 07:50:58,780] - [rest_client:828] INFO - adding remote cluster hostname:10.3.3.240:8091 with username:password Administrator:password name:cluster0 to source node: 10.3.3.239:8091
[2014-07-19 07:50:59,048] - [rest_client:874] INFO - starting continuous replication type:capi from default to default in the remote cluster cluster1
[2014-07-19 07:50:59,250] - [basetestcase:264] INFO - sleep for 5 secs. ...
[2014-07-19 07:51:04,256] - [rest_client:874] INFO - starting continuous replication type:capi from bucket0 to bucket0 in the remote cluster cluster1
[2014-07-19 07:51:04,559] - [basetestcase:264] INFO - sleep for 5 secs. ...
[2014-07-19 07:51:15,538] - [rest_client:747] ERROR - http://10.3.3.239:8091/nodes/self error 500 reason: unknown ["Unexpected server error, request logged."]
http://10.3.3.239:8091/nodes/self with status False: [u'Unexpected server error, request logged.']
[2014-07-19 07:51:15,538] - [xdcrbasetests:139] ERROR - list indices must be integers, not str
[2014-07-19 07:51:15,539] - [xdcrbasetests:140] ERROR - Error while setting up clusters: (<type 'exceptions.TypeError'>, TypeError('list indices must be integers, not str',), <traceback object at 0x31dd7a0>)
[2014-07-19 07:51:15,540] - [xdcrbasetests:179] INFO - ============== XDCRbasetests cleanup is started for test #11 offline_cluster_upgrade ==============


 Comments   
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Logs from 10.3.3.239 at that duration:

[ns_server:debug,2014-07-19T7:50:40.998,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 0 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.137,ns_1@10.3.3.239:<0.995.0>:mc_connection:do_notify_vbucket_update:112]Signaled mc_couch_event: {set_vbucket,"bucket0",843,active,0}
[ns_server:debug,2014-07-19T7:50:41.142,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 1 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.143,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 2 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.139,ns_1@10.3.3.239:capi_set_view_manager-bucket0<0.979.0>:capi_set_view_manager:handle_info:377]Usable vbuckets:
[997,933,869,984,920,856,971,907,843,958,894,1022,945,881,1009,996,964,932,
 900,868,983,951,919,887,855,1015,970,938,906,874,1002,989,957,925,893,861,
 1021,976,944,912,880,848,1008,995,963,931,899,867,982,950,918,886,854,1014,
 969,937,905,873,1001,988,956,924,892,860,1020,975,943,911,879,847,1007,994,
 962,930,898,866,981,949,917,885,853,1013,968,936,904,872,1000,987,955,923,
 891,859,1019,974,942,910,878,846,1006,993,961,929,897,865,980,948,916,884,
 852,1012,999,967,935,903,871,986,954,922,890,858,1018,973,941,909,877,845,
 1005,992,960,928,896,864,979,947,915,883,851,1011,998,966,934,902,870,985,
 953,921,889,857,1017,972,940,908,876,844,1004,991,959,927,895,863,1023,978,
 946,914,882,850,1010,965,901,952,888,1016,939,875,1003,990,926,862,977,913,
 849]
[ns_server:debug,2014-07-19T7:50:41.150,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 3 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2014-07-19T7:50:41.158,ns_1@10.3.3.239:couch_stats_reader-bucket0<0.1006.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 4 ({not_found,no_db_file}). Ignoring
[views:debug,2014-07-19T7:50:41.246,ns_1@10.3.3.239:mc_couch_events<0.428.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for bucket0/842. Updated state: active (0)
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Problem appeared on ubuntu platform only.




[MB-11779] Memory underflow in updates-only scenario with 5 buckets Created: 21/Jul/14  Updated: 26/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/view/lab/job/perf-dev/503/artifact/
Is this a Regression?: Yes

 Description   
Essentially re-opened MB-11661.

2 nodes, 5 buckets, 200K x 1KB docs per bucket (non-DGM), 2K updates per bucket.

Mon Jul 21 13:24:34.955935 PDT 3: (bucket-1) Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...

 Comments   
Comment by Sriram Ganesan [ 22/Jul/14 ]
Pavel

How often would you say this reproduces in your environment? I tried this locally a few times and didn't hit this.
Comment by Pavel Paulau [ 23/Jul/14 ]
Pretty much every time.

It usually takes >10 hours before test encounters GIGANTOR failure. But slowly decreasing mem_used obviously indicates the issue.
Comment by Pavel Paulau [ 26/Jul/14 ]
Just spotted again in different scenario, build 3.0.0-1024. Proof: https://s3.amazonaws.com/bugdb/jira/MB-11779/172.23.96.11.zip .




[MB-11778] upr replica is unable to detect death of upr producer (was: Some replica items not deleted) Created: 21/Jul/14  Updated: 24/Jul/14  Resolved: 22/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS 6.x

Triage: Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://172.23.106.47:8091/index.html
http://172.23.106.45:8091/index.html

https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar
Is this a Regression?: Unknown

 Description   
I'm seeing a bug similar to MB-11573 on 991. 600 replica items haven't been deleted. However curr_items and vb_active_curr_items are correct.


2014-07-21 18:18:44 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 2800 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.47:11210 default
2014-07-21 18:18:45 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 172.23.106.48:11210 default
2014-07-21 18:18:45 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
2014-07-21 18:18:48 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-07-21 18:18:49 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 3400 == 2800 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket

testcase:
./testrunner -i sanity.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,reboot=dest_node,items=2000,rdirection=bidirection,replication_type=xmem,standard_buckets=1,sasl_buckets=1,pause=source-destination,doc-ops=update-delete,doc-ops-dest=update-delete

What the test does:

3nodes * 3nodes, bi-dir xdcr on 3 buckets
1. Load 2k items on both clusters. Pause all xdcr(all items got replicated by this time)
2. Reboot one dest node (.48)
3. After warmup, resume replication on all buckets, on both clusters
4. 30% Update, 30% delete items on both sides. No expiration set.
5. Verify item count , value and rev-ids.


The cluster is available for debugging until tomorrow morning. Thanks.

 Comments   
Comment by Chiyoung Seo [ 21/Jul/14 ]
Mike,

Can you please look at this issue? The live cluster is available now.

Seems like the deletions are not replicated.
Comment by Mike Wiederhold [ 22/Jul/14 ]
The cluster looks fine right now so the problem seemed to work itself out. In the future please run one of the scripts we have to figure out which vbuckets are mismatched in the cluster. This will greatly reduce the amount of time needed to look through the cbcollectinfo logs.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
It was different issue, removed my comment.
Comment by Mike Wiederhold [ 22/Jul/14 ]
Alk,

In the memcached logs it looks like at the time that this bug was reported there were missing items. Then about 2 hours I see ns_server create a bunch of replication streams and all of the items that were "missing" are no longer actually missing. Can you take a look at this from the ns_server side and see why it took so long to create the replication streams?

Also, note that as of right now there is only a live cluster and no cbcollectinfo on the ticket.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Which cluster I need to look at Mike ?
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://172.23.106.47:8091/index.html (This is the one that had the problem. Node .47 in particular)
http://172.23.106.45:8091/index.html
Comment by Aruna Piravi [ 22/Jul/14 ]
Pls note cbcollect has already been attached under link to logs section - https://s3.amazonaws.com/bugdb/jira/MB-11573/logs.tar , along with the cluster IPs.
Comment by Aruna Piravi [ 22/Jul/14 ]
And cbcollectinfo was grabbed at the time replica items were incorrect.

Just curious, only replica items were incorrect. Active vb items on both clusters were correct. Does this still have to do with xdcr?
Comment by Aruna Piravi [ 22/Jul/14 ]
ok, I think Mike meant the intra-cluster replication streams.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
I'm not seeing upr replicators spotting this shutdown at all. And I've just verified that I kill -9 memcached, erlang's replicator correctly detects connection closure and reestablishes connections.

Will now test with VMs and reboot.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Confirming manually by doing "hard reset" of VM and observing that other VM does not re-establish upr connections after resetted VM is rebooted.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
It is "it".
Comment by Mike Wiederhold [ 22/Jul/14 ]
http://review.couchbase.org/#/c/39683/
Comment by Aruna Piravi [ 24/Jul/14 ]
Verified on 1014. Closing this issue, thanks.




[MB-11777] couchbase-cli throw out error in collect-log-starts Created: 21/Jul/14  Updated: 22/Jul/14  Resolved: 22/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: unix

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-991 on one node.
Do cluster wide collect info using couchbase-cli command
ticket parameter is optional but when run the command, it shows command need ticket param
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --allnodes --upload --upload-host=http://abcnn.com --customer=1234
ERROR: unable to start log collection: (400) Bad Request
{u'ticket': u'must contain only [0-9] and be no longer than 7 characters'}




 Comments   
Comment by Bin Cui [ 22/Jul/14 ]
http://review.couchbase.org/#/c/39677/




[MB-11776] UI shows rebalance to change mode from 2.0 to 3.0 after all nodes in cluster converted to 3.0 Created: 21/Jul/14  Updated: 21/Jul/14  Resolved: 21/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Aleksey Kondratenko
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit and centos 6.x 64-bit

Attachments: Zip Archive 192.168.171.148-7212014-1129-diag.zip     Zip Archive 192.168.171.149-7212014-1130-diag.zip     Zip Archive 192.168.171.150-7212014-1131-diag.zip     Zip Archive 192.168.171.151-7212014-1132-diag.zip     PNG File ss_2014-07-21_at_11.28.20 AM.png    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Manifest file for build 3.0.0-990 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-990-rel.deb.manifest.xml
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.0.0 on 2 nodes 148 and 149
Create cluster of 2 nodes
Create 3 buckets: default, sasl and standard bucket.
No doc or view created
Load 1K items to each bucket
Install couchbase server 3.0.0-990 in to node 150 and 151
Add node 150 and 151 to 2.0 cluster.
3.0 node takes over master. Passed
Rebalance in 2 3.0 nodes into cluster. Passed
Remove 2 2.0 nodes out of cluster. Failed even thought the log saying rebalance successful.
1 3.0 node was kicked out of cluster and become pending node at the end of rebalance. Filed bug MB-11774 as I could reproduce it in centos cluster.
Rebalance again to rebalane pending node. Passed
When rebalance is done and all nodes in cluster become 3.0 version, UI shows cluster doing one more rebalance to change mode from 2.0 to 3.0



 Comments   
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
This is expected behavior of upgrade to upr. We do one final rebalance like activity to upgrade replication from tap to upr. And UI displays it as rebalance.




[MB-11775] Rebalance-stop is slow -- takes multiple attempts to stop rebalance Created: 21/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Ketaki Gangal Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel
Centos 6.4

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Setup:

1. Cluster 7 nodes, 2 buckets, 1 design doc X 2 views
2. Load 120M, 99M items on both the buckets, dgm state of 70% active resident.
3. Do a graceful failover on 1 node
4. Choose delta recovery, add back node and rebalance

5. I tried to stop the rebalance a couple of times ( about 10 times) --- Unsuccessful on a number of attempts.
Rebalance eventually failed with reason " Rebalance exited with reason stop" --- Rebalance stop is not working as expected.

- Attaching logs



 Comments   
Comment by Ketaki Gangal [ 21/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/MB-11775/11775.tar
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
I'll need more diagnostics to figure out this case. I've added some in this commit (still pending review and merge): http://review.couchbase.org/39625

Please retest after this commit is merged so that I can see what makes rebalance stop slow.
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
referenced commit is now in. So test as soon as you get next build
Comment by Ketaki Gangal [ 22/Jul/14 ]
Tested with build which contains the above commit - build 3.0.0-999-rel.

Seeing same behaviour, where it takes a couple of attempts to stop rebalance.

Logs at https://s3.amazonaws.com/bugdb/MB-11775/11775-2.tar
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Uploaded probable fix:

http://review.couchbase.org/39694

With this fix (assuming I am right about cause of slowness) we'll be able to stop even if some node is stuck somewhere in janitor_agent which could in turn be due to view engine. That would mean that original slowness would be (maybe) visible elsewhere. Possibly in harder to debug way.

So in order to diagnose _that_ I need you to capture diag or collectinfo from just one node _immediately_ after you're sending stop and it is slow. If this is done correctly I'll be able to see what is causing that slowness in first place. Note that it needs to be done on build prior to rebalance stop fix that I've referred to above.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
merged. So rebalance stop should not be slow anymore. But see above for some additional investigation that we should do.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
reverted for now
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
Merged hopefully more correct fix: http://review.couchbase.org/39756




[MB-11774] 3.0 node put in pending mode after online upgrade from 2.0 to 3.0.0-990 Created: 21/Jul/14  Updated: 21/Jul/14  Resolved: 21/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64bit and centos 6.x 64-bit

Attachments: Zip Archive 10.3.121.241-7212014-1252-diag.zip     Zip Archive 10.3.3.224-7212014-1251-diag.zip     Zip Archive 10.6.2.112-7212014-1250-diag.zip     Zip Archive 10.6.2.113-7212014-1250-diag.zip     Zip Archive 192.168.171.148-7212014-1129-diag.zip     Zip Archive 192.168.171.149-7212014-1130-diag.zip     Zip Archive 192.168.171.150-7212014-1131-diag.zip     Zip Archive 192.168.171.151-7212014-1132-diag.zip     PNG File ss_2014-07-21_at_11.27.41 AM.png    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Manifest file for build 3.0.0-990 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-990-rel.deb.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 2.0.0 on 2 nodes 148 and 149
Create cluster of 2 nodes
Create 3 buckets: default, sasl and standard bucket.
No doc or view created
Load 1K items to each bucket
Install couchbase server 3.0.0-990 in to node 150 and 151
Add node 150 and 151 to 2.0 cluster.
3.0 node takes over master. Passed
Rebalance in 2 3.0 nodes into cluster. Passed
Remove 2 2.0 nodes out of cluster. Failed even thought the log saying rebalance successful.
1 3.0 node was kicked out of cluster and become pending node at the end of rebalance.


 Comments   
Comment by Thuan Nguyen [ 21/Jul/14 ]
This bug did not show in build 3.0.0-973
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
I'm not seeing any signs of troubles in this logs. And it's also unclear what you're trying to show on this screenshot.
Comment by Thuan Nguyen [ 21/Jul/14 ]
Re-run same test with external vms to debug. Got the same error.
1:10.6.2.112
2:10.6.2.113

3:10.3.3.224
4:10.3.121.241

Install couchbase server 2.0.0 on 2 nodes 112 and 113
Create cluster of 2 nodes
Create 3 buckets: default, sasl and standard bucket.
No doc or view created
Load 1K items to each bucket
Install couchbase server 3.0.0-990 in to node 224 and 241
Add node 224 and 241 to 2.0 cluster.
3.0 node takes over master. Passed
Rebalance in 2 3.0 nodes into cluster. Log show success but it's not true. Node 224 suddenly becomes in pending state during rebalance.

Live cluster is available. I will upload collect info soon
Comment by Thuan Nguyen [ 21/Jul/14 ]
Collect info file from centos nodes
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
We introduced it as part of this commit:

commit 6633b0d0dbb08656b63b7202924680279a8cde8e
Author: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Date: Wed Jul 16 16:18:20 2014 -0700

    MB-11622 Don't accept changes to per node keys in mixed clusters.
    
    Change-Id: Ia5bcd9bd795b201bce90829f02c31aecdb2df004
    Reviewed-on: http://review.couchbase.org/39464
    Tested-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
    Reviewed-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>


We'll fix asap. Thanks for spotting it.
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
http://review.couchbase.org/39619
http://review.couchbase.org/39620
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
merged




[MB-11773] 8091 and 8092 time out after 5 minutes Created: 21/Jul/14  Updated: 21/Jul/14  Resolved: 21/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Michael Nitschinger Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Hey folks,

I'm not sure if this is a bug, expected or something else, but I came across it and I'd like to get clarification.
During my 2.0 development I noticed that netty was throwing me ChannelInactive events from here and there. I looked a bit closer and found the following pattern:

- I didn't run any load
- It is happening _exactly_ every 5 mins
- It only happens on 8091 and 8092 (not on 11210).

Is there a setting on the server that closes them after 5 min of inactivity? I for sure can deal with that on the client side, would just like to get clarification.

 Comments   
Comment by Mark Nunberg [ 21/Jul/14 ]
Note that this is not a client-side issue, and is reproducible by simply doing a 'nc host:8091' and waiting five minutes
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
5 minutes is appropriate for keep alive sockets. And more than appropriate for fresh sockets. For active sockets I'm not aware of any issues.
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
It's part of default mochiweb (our web server stack for erlang) to do this. But it's also part of most if not all web servers to behave like this.

While we could do something to prevent mochiweb from closing sockets quickly, that would still not work for some cases of http proxies use. Or it could complicate our life hard in future. It's I think ok (and in fact a must ) for web server to do what mochiweb is doing here.
Comment by Mark Nunberg [ 21/Jul/14 ]
Can mochi web send out a Keep-Alive: timeout=300 header, then? This is fairly conventional
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
It's possible. But AFAIK it's not required by spec and even after sending keepalive, server still has right to terminate idle connection sooner.
Comment by Aleksey Kondratenko [ 21/Jul/14 ]
Looking at rfc2616 it looks like keep-alive headers are 1.0 artifact and not required at all for http 1.1. So relying on it in our clients doesn't seem like good choice.




[MB-11772] Provide the facility to release free memory back to the OS from running mcd process Created: 21/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1, 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Dave Rigby Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
On many occasions we have seen tcmalloc being very "greedy" with free memory and not releasing it back to the OS very quickly. There have even been occasions where this has triggered the Linux OOM-killer due to the memcached process's having too much "free" tcmalloc memory still resident.

tcmalloc by design will /slowly/ return memory back to the OS - via madvise(DONT_NEED) - but this rate is very conservative, and it can only be changed currently by modifying an environment variable, which obviously cannot be done on a running process.

To help mitigate these problems in future, it would be very helpful to allow the user to request that free memory is released back to the OS.


 Comments   
Comment by Dave Rigby [ 21/Jul/14 ]
http://review.couchbase.org/#/c/39608/




[MB-11771] Doc : mismatch in doc about swap configuration Created: 21/Jul/14  Updated: 21/Jul/14  Resolved: 21/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Trivial
Reporter: mdespriee Assignee: Ruth Harris
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In this page (http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#couchbase-bestpractice), it is stated that :
"Swap configuration
Swap should be configured on the Couchbase Server. This prevents the operating system from killing Couchbase Server should the system RAM be exhausted. Having swap provides more options on how to manage such a situation."

And later on (http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#swap-space), it is advised to disable it (swappiness to 0).

From what I read here and there in blogs and KB, I understand that the latter is preferred, so I suggest to update the doc (http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#couchbase-bestpractice).

 Comments   
Comment by Dave Rigby [ 21/Jul/14 ]
Note that `swappiness==0` doesn't actually disable swap, it just sets how likely the OS is to swap, and setting to zero essentially means it shouldn't swap unless there is (close to) zero free physical RAM.

While it may appear that these two recommendations are contradictory, they are not - the intent to the OS is:

* "don't swap unless you really have to" (for maximum performance),
* "but if you need to, here is some swap space to make use of" (so in the event of low memory Couchbase process will not potentially be killed by Linux's OOM-killer, but will continue to run (but slower from swap).
Comment by Ian McCloy [ 21/Jul/14 ]
I agree it's not clear but we shouldn't document to not configure any swap. We recommend that swap is available for when memory is completely exhausted but it should be set to the lowest priority, so that it isn't casually allocated by the OS. Swap is also used for NUMA to send memory pages from one CPU to another.
Comment by mdespriee [ 21/Jul/14 ]
Understood. Thanks for the clarification




[MB-11770] Re-investigate impact of xdcrMaxConcurrentReps on XDCR latency (LAN/WAN) and choose safer value for 3.0 Created: 21/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: performance
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified





[MB-11769] Major regression in write performance (xdcr, 2 buckets) Created: 21/Jul/14  Updated: 21/Jul/14  Resolved: 21/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Duplicate Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File disk_write_queue.png    
Issue Links:
Duplicate
duplicates MB-11731 Persistence to disk suffers from buck... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/381/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Disk write queue, source, 2.5.1-1083 vs 3.0.0-988:
2-3K vs. 120-150K

Full report: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_251-1083_3a7_access&snapshot=atlas_c1_300-988_7a5_access

Disk write queue, destination, 2.5.1-1083 vs 3.0.0-988:
5-6K vs. 60-80K

Full report: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_251-1083_c5b_access&snapshot=atlas_c2_300-988_abb_access


 Comments   
Comment by Pavel Paulau [ 21/Jul/14 ]
It's basically the same problem as MB-11731.

Disk write queue is increasing when compaction starts. But difference (comparing to 2.5.1) is greater now.




[MB-11768] movement of 27 empty replica vbuckets gets stuck in seqnoWaiting Created: 20/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Sriram Ganesan
Resolution: Cannot Reproduce Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-988

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Issue Links:
Duplicate
is duplicated by MB-11796 Rebalance after manual failover hangs... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.11.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.12.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.13.zip
https://s3.amazonaws.com/bugdb/jira/MB-11768/172.23.96.14.zip
Is this a Regression?: Yes

 Description   
Rebalance of 10 empty buckets never finishes.

Rebalance of first bucket ("bucket-10") completed successfully.

Movement of 27 vbuckets in bucket "bucket-9" started but obviously makes no progress.

451, 455, 459, 463, 467, 471, 475, 479, 483, 487, 491, 495, 499, 503, 507, 511, 811, 815, 819, 823, 827, 831, 835, 839, 843, 847, 851

According to master events the last step in all cases is seqnoWaitingStarted:

{u'node': u'ns_1@172.23.96.13', u'bucket': u'bucket-9', u'pid': u'<0.19962.2>', u'ts': 1405841984.068374, u'chainBefore': [u'172.23.96.13:11209', u'172.23.96.11:11209'], u'chainAfter': [u'172.23.96.14:11209', u'172.23.96.11:11209'], u'vbucket': 851, u'type': u'vbucketMoveStart'}
{u'bucket': u'bucket-9', u'state': u'replica', u'ts': 1405841984.069871, u'host': u'172.23.96.11:11209', u'vbucket': 851, u'type': u'vbucketStateChange'}
{u'bucket': u'bucket-9', u'state': u'replica', u'ts': 1405841984.070029, u'host': u'172.23.96.14:11209', u'vbucket': 851, u'type': u'vbucketStateChange'}
{u'node': u'172.23.96.14:11209', u'bucket': u'bucket-9', u'vbucket': 851, u'type': u'indexingInitiated', u'ts': 1405841984.073599}
{u'bucket': u'bucket-9', u'vbucket': 851, u'type': u'backfillPhaseEnded', u'ts': 1405841984.074577}
{u'node': u'172.23.96.14:11209', u'seqno': 0, u'bucket': u'bucket-9', u'ts': 1405841984.075081, u'vbucket': 851, u'type': u'seqnoWaitingStarted'}
{u'node': u'172.23.96.11:11209', u'seqno': 0, u'bucket': u'bucket-9', u'ts': 1405841984.075081, u'vbucket': 851, u'type': u'seqnoWaitingStarted'}

Not surprisingly 1051 replica vubckets are reported.

 Comments   
Comment by Sriram Ganesan [ 21/Jul/14 ]
Pavel

I tried running the rebalance-in with cluster_run with 10 empt buckets and haven't been able to reproduce this in the latest repo. Can you please try with the latest build and update the ticket to see if this is still a problem?
Comment by Pavel Paulau [ 21/Jul/14 ]
Sriram,

The issue is very occasional, probably only one out of 50 tests fails.

Also build 3.0.0-988 is not that old.

If you cannot find anything from logs then please close as incomplete.
Comment by Pavel Paulau [ 23/Jul/14 ]
See also MB-11796. The same issue, it may have more details for you.

Will live cluster help?
Comment by Sriram Ganesan [ 23/Jul/14 ]
Sure. I shall take a look at the live cluster.
Comment by Pavel Paulau [ 23/Jul/14 ]
Live cluster with MB-11796 (which I hope a duplicate): 172.23.96.11:8091 Administrator:password
Comment by Pavel Paulau [ 24/Jul/14 ]
I'm closing the ticket for now.

I will provide a live cluster if it happens again.




Generated at Sat Jul 26 02:58:12 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.