[MB-11838] Rename upr to dcp in couchdb Created: 29/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Nimish Gupta Assignee: Nimish Gupta
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Nimish Gupta [ 29/Jul/14 ]
http://review.couchbase.org/#/c/39865/




[MB-11837] need to enhance log in ns_server as in bug MB-11836 Created: 28/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: feature-backlog, 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Thuan Nguyen Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I create bug MB-11836 and attached all log from all nodes in cluster.
But there is not log that capture the issue in the bug. I need to reproduce it to debug
in live cluster. In real life, sometimes it is hard to get live cluster from customer production to debug
the issue.




[MB-11836] rebalance button grey out after add a node to a node Created: 28/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Attachments: Zip Archive 192.168.171.148-7282014-187-diag.zip     Zip Archive 192.168.171.149-7282014-188-diag.zip     PNG File ss_2014-07-28_at_6.06.52 PM.png    
Issue Links:
Relates to
relates to MB-11616 Rebalance not available 'pending add ... In Progress
Triage: Triaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this buid
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1040 on 2 ubuntu 12.04 64-bit
At node 148, add node 149 to node 148.
Node 149 added to node 148 but rebalance button is grey out and not clickable

 Comments   
Comment by Aleksey Kondratenko [ 28/Jul/14 ]
Logs are not enough here. And I was unable to reproduce it.

Please do:

a) leave broken system running for me

b) try force-reloading a page (Ctrl-F5)
Comment by Thuan Nguyen [ 28/Jul/14 ]
I could repo on 2 other ubuntu vms
10.6.2.73
10.6.2.137

Node 137 was added to node 73
Comment by Thuan Nguyen [ 28/Jul/14 ]
Live cluster is available to debug
Comment by Aliaksey Artamonau [ 28/Jul/14 ]
This happens only when there are no buckets. I reverted the change that broke it: http://review.couchbase.org/#/c/39983/. Assigning to Pavel for him to work on the fix.
Comment by Aleksey Kondratenko [ 28/Jul/14 ]
Lets work on fix as part of MB-11616 that introduced this bug. Because the bug is now fixed lets just close it.
Comment by Parag Agarwal [ 28/Jul/14 ]
Alk, please check 10.6.2.144, this is build 1044
Comment by Aliaksey Artamonau [ 28/Jul/14 ]
Build 1044 doesn't have the required commit.
Comment by Parag Agarwal [ 28/Jul/14 ]
ok, will check for the build with the fix





[MB-11835] Stack-corruption crash opening db, if path len > 250 bytes Created: 28/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Jens Alfke Assignee: Jung-Sang Ahn
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
If the path to a database file is more than 250 bytes long, and auto-compaction is enabled, the stack will be corrupted, probably causing a crash. The reason is that compactor_is_valid_mode() copies the path into a stack buffer 256 bytes long and then appends a 5-byte suffix to it.

The buffer needs to be bigger. I suggest using MAXPATHLEN as the size, at least on Unix systems; it's a common Unix constant defined in <sys/param.h>. On Apple platforms the value is 1024.

Backtrace of the crash in the iOS simulator looks like this; apparently __assert_rtn is an OS stack sanity check.

* thread #1: tid = 0x6112d, 0x0269969e libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0269969e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x0265e2c1 libsystem_pthread.dylib`pthread_kill + 101
    frame #2: 0x023a59c9 libsystem_sim_c.dylib`abort + 127
    frame #3: 0x0237053b libsystem_sim_c.dylib`__assert_rtn + 284
  * frame #4: 0x000b3644 HeadlessBee`compactor_is_valid_mode(filename=<unavailable>, config=<unavailable>) + 276 at compactor.cc:774
    frame #5: 0x000bafd9 HeadlessBee`_fdb_open(handle=<unavailable>, filename=<unavailable>, config=0xbfff8ee0) + 201 at forestdb.cc:842
    frame #6: 0x000baede HeadlessBee`fdb_open(ptr_handle=<unavailable>, filename=<unavailable>, fconfig=0xbfff9010) + 158 at forestdb.cc:528

The actual path causing the crash (251 bytes long) was:

/Volumes/Retina2/Users/snej/Library/Developer/CoreSimulator/Devices/F889372A-F7E8-4534-B6B3-C3E23EFE528C/data/Applications/988D316C-31F3-4A05-8EDC-79C86061C7C9/Library/Application Support/CouchbaseLite/test13_itunesindex-db.cblite2/x:artists.viewindex




[MB-11834] couchbase-cli cluster-wide collectinfo does not work in single node with real IP Created: 28/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 4-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: No

 Description   
Install coucbase server 3.0.0-1035 on single node ubuntu 12.04 64-bit (IP 192.168.171.148)
Run cluster-wide collectinfo on this node.
It's failed to start collect info with real IP

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.148
NODES: ns_1@192.168.171.148
ERROR: unable to start log collection: (400) Bad Request
{u'nodes': u'Unknown nodes: ["ns_1@192.168.171.148"]'}

When run with IP 127.0.0.1, it does not show any error nor the status of command executed.
 
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@127.0.0.1
NODES: ns_1@127.0.0.1

 Comments   
Comment by Aleksey Kondratenko [ 28/Jul/14 ]
Looks like bug in CLI tools.

It is not using API correctly.

It cannot just build name name as ns_1@<ip> and expect it to work. And it is 100% same as failover and other APIs using node name.

You're supposed to take otpNode name from cluster's pool details. My understanding is that cli is doing it at least for failover.
Comment by Bin Cui [ 28/Jul/14 ]
http://review.couchbase.org/#/c/39980/
Comment by Bin Cui [ 28/Jul/14 ]
For single node, the actual node will be 127.0.0.1 instead of real IP, which can be found in server page at admin console. So real IP wont work in this case anyway.




[MB-11833] Additional documentation on couchbase user and required permissions Created: 28/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Mel Boulos Assignee: Amy Kurtzman
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Our docs don't mention anything about couchbase user or its permissions. National Forest Service would like details on what specific permission the couchbase user needs.

 Comments   
Comment by Anil Kumar [ 28/Jul/14 ]
Assigning to Steve Yen to provide information.
Comment by Steve Yen [ 28/Jul/14 ]
On linux systems, the couchbase user and couchbase group are created as part of the rpm and dpkg/deb installation packaging, and on those systems, following the usual practices, the couchbase server processes are spawned using that couchbase user/group to limit runtime access.

The permissions that the couchbase user/group need include the ability execute couchbase executables (e.g., start the couchbase cluster manager and data nodes), and to be able to create, read and modify couchbase related files and directories, including for database files and log files.

Comment by Steve Yen [ 28/Jul/14 ]
Amy,
Let me know if any more questions.
Cheers,
Steve




[MB-11832] [System Test] Rebalance + Indexing stuck on Rebalance-In in light DGM setup Created: 28/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Ketaki Gangal
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1021-rel

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
1. Create 2 buckets, 1 ddoc,2 Views each
2. Load 120M, 113M items on respective buckets, dgm 70%
3. Wait for initial indexing to complete
4. Rebalance In 1 node - Rebalance is stuck at about 0%

-- Seeing a few error messages on server timeouts.

upr client (default, mapreduce_view: default _design/ddoc1 (prod/main)): Obtaining mutation from server timed out after 60.0 seconds [RequestId 353650, PartId 7, StartSeq 152559, EndSeq 152824]. Waiting...

-- Attaching logs.

 Comments   
Comment by Parag Agarwal [ 28/Jul/14 ]
We had hit another bug with 1035: https://www.couchbase.com/issues/browse/MB-11827, which has indexing stuck during rebalance
Comment by Ketaki Gangal [ 28/Jul/14 ]
Logs https://s3.amazonaws.com/bugdb/11832/index.tar

and https://s3.amazonaws.com/bugdb/11832/part2.tar
Comment by Mike Wiederhold [ 28/Jul/14 ]
Ketaki,

Did this rebalance eventually complete? I know you guys on;y check that rebalance progresses for a a minute or so and if it doesn't then you end the test. After looking at the logs I think this is only a temporary stuck issue. Please confirm.




[MB-11831] a doc gets added but not indexed Created: 28/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: secondary-index
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Milan Simonovic Assignee: Sriram Melkote
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: tested on ubuntu 14.04 64bit

Triage: Untriaged
Flagged:
Impediment
Is this a Regression?: Unknown

 Description   
code to reproduce the bug: https://github.com/mbsimonovic/couchbase-bugs




[MB-11830] {UPR}:: CBTransfer fails {error: could not read _local/vbstate} after Rebalance-in Created: 27/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-10.6.2.147

Triage: Triaged
Is this a Regression?: Unknown

 Description   
1035, Centos 6x

1. Create 3 Node cluster
2. Add a default bucket
3. Load 100 K items
4. Rebalance-in 1 node

Rebalance succeeds and after all the queues have drained, we compare the replica items loaded initially vs present after rebalance-in

The test fails since replica items are missing.

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,skip_cleanup=True,blob_generator=false

This test case passed for 973 build

https://s3.amazonaws.com/bugdb/jira/MB-11830/replica_keys_missing.txt

https://s3.amazonaws.com/bugdb/jira/MB-11830/1035_log_data_log.tar.gz


 Comments   
Comment by Mike Wiederhold [ 28/Jul/14 ]
I checked two of the vbuckets that supposedly had missing keys.

mike12109 -> vbucket 640
mike12108 -> vbucket 391

Mike-Wiederholds-MacBook-Pro:1035_log_data_log.tar mikewied$ cat 10.6.2.145/stats.log | grep '_391:num_items'
 vb_391:num_items_for_persistence: 0
 vb_391:num_items: 103
Mike-Wiederholds-MacBook-Pro:1035_log_data_log.tar mikewied$ cat 10.6.2.144/stats.log | grep '_391:num_items'
 vb_391:num_items_for_persistence: 0
 vb_391:num_items: 103


In both cases there don't seem to be dataloss. Please include vbuckets that have mismatched items if there are any. It's also not clear how you found these missing keys.
Comment by Parag Agarwal [ 28/Jul/14 ]
we take snap-shots via cbtransfer before and after and then compare. If you run the above mentioned test case using cluster run, it should repro
Comment by Mike Wiederhold [ 28/Jul/14 ]
Parag,

If your getting the keys with cbtransfer then the bug might be there. Is there any verification stats verification done before you run cbtransfer?
Comment by Parag Agarwal [ 28/Jul/14 ]
Mike

the test reads the couch store files via cbtransfer. We can check with Bin if any changes were made.

We do check for stats items like expected active and replica items and queue are drained. After this check we do more detailed data validation

Bin: Was there any changes made that might impact cbtransfer? we using 1035 build and seeing inconsistency in replica items when checking before and after rebalance-in operation
Comment by Mike Wiederhold [ 28/Jul/14 ]
In order to debug this further on the server side I would need the data files. If Bin doesn't find anything from looking at the cbtransfer script then please re-run this test and either let me look at the live cluster or attach the data files along with the logs.
Comment by Bin Cui [ 28/Jul/14 ]
I dont think it is something related to cbtransfer. If you can get right replica data before rebalance and miss some after it, either ep_engine won't provide those missing items, or we miss some items due to seqno change.

1. What if you run a fullback after rebalance? Do you still have missing items?
2. If solely based on incremental backup, maybe the missing data are caused by wrong seqno or failover logs? I am not sure.
Comment by Mike Wiederhold [ 28/Jul/14 ]
I've looked the data files an there doesn't appear to be any signs of data loss. Even the keys that have reported missing can be found in the data files. We will keep investigating this to see why this test is failing, but there doesn't appear to be any data loss on the server at the moment.
Comment by Parag Agarwal [ 28/Jul/14 ]
Bin:: We are hitting the following issue after rebalance

For Replica

Last login: Mon Jul 28 13:31:33 2014 from 10.17.45.173
[root@palm-10307 ~]# /opt/couchbase/bin/cbtransfer couchstore-files:///opt/couchbase/var/lib/couchbase/data csv:/tmp/ab45472c-1695-11e4-9a2f-005056970042.csv -b default -u Administrator -p password --source-vbucket-state=replica
error: could not read _local/vbstate from: /opt/couchbase/var/lib/couchbase/data/default/106.couch.7; exception: Expecting object: line 1 column 81 (char 81)
[root@palm-10307 ~]#

Can you please check, the cluster is live: 10.6.2.144

I was able to repro it
Comment by Parag Agarwal [ 28/Jul/14 ]
Occurs for Active

2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] running command.raw on 10.6.2.145: /opt/couchbase/bin/cbtransfer couchstore-files:///opt/couchbase/var/lib/couchbase/data csv:/tmp/c71f43ee-1695-11e4-9a2f-005056970042.csv -b default -u Administrator -p password
2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.execute_command_raw] command executed successfully
2014-07-28 13:28:47 | INFO | MainProcess | test_thread | [remote_util.log_command_output] error: could not read _local/vbstate from: /opt/couchbase/var/lib/couchbase/data/default/127.couch.8; exception: Expecting object: line 1 column 81 (char 81)
Comment by Bin Cui [ 28/Jul/14 ]
It happens when the tools tries to read the couchstore files using couchstore API. The code snippet is as:

                store = couchstore.CouchStore(f, 'r')
                try:
                    doc_str = store.localDocs['_local/vbstate']
                    if doc_str:
                        doc = json.loads(doc_str)
                        state = doc.get('state', None)
                        if state:
                            vbucket_states[state][vbucket_id] = doc
                        else:
                            return "error: missing vbucket_state from: %s" \
                                % (f), None
                except Exception, e:
                    return ("error: could not read _local/vbstate from: %s" +
                            "; exception: %s") % (f, e), None

Need to figure out why couchstore API launches exceptions for this case.
Comment by Bin Cui [ 28/Jul/14 ]
Parag, can you attach the /opt/couchbase/var/lib/couchbase/data/default/106.couch.7 ? So we can figure out why exception thrown.
Comment by Mike Wiederhold [ 28/Jul/14 ]
Sundar,

It's because the failover log is empty. This is likely a regression from the set vbucket state change you made where you cache the failover log.

[root@palm-10307 ~]# cat /opt/couchbase/var/lib/couchbase/data/default/106.couch.7
?)+k
    ?cr?fT?R_local/vbstate{"\": "dead","checkpoint_id\0","max_deleted_seqno": Dfailover_table": }???)????

                                                                                                         "k[root@palm-10307 ~]#
[root@palm-10307 ~]#
Comment by Sundar Sridharan [ 28/Jul/14 ]
thanks Mike, looks like the table with no entries case needs to be handled too
http://review.couchbase.org/39967
Comment by Sundar Sridharan [ 28/Jul/14 ]
fix was merged. Parag, can you please verify the change?




[MB-11828] {UPR} :: Rebalance exit after Hard failover+addback Created: 27/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Sarath Lakshman
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-147

Triage: Untriaged
Is this a Regression?: Yes

 Description   
1033, centos 6x

1. Create a 4 node cluster
2. Create a default bucket with 1 K items
3. Create 3 views and query them
4. Hard Failover a node and then add-back
5. Rebalance the cluster

Rebalance exits with the following exception

Rebalance exited with reason {unexpected_exit,
{'EXIT',<0.1262.4>,
{bulk_set_vbucket_state_failed,
[{'ns_1@10.6.2.147',
{'EXIT',
{{{{badmatch,
{error,
{error,
<<"Partition 123 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3,
[{file,
"src/capi_set_view_manager.erl"},
{line,218}]},
{gen_server,handle_msg,5,
[{file,"gen_server.erl"},{line,585}]},
{gen_server,init_it,6,
[{file,"gen_server.erl"},{line,304}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]},
{gen_server,call,
['capi_set_view_manager-default',
{wait_index_updated,123},
infinity]}},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.6.2.147'},
{if_rebalance,<0.997.4>,
{update_vbucket_state,26,replica,
undefined,'ns_1@10.6.2.144'}},
infinity]}}}}]}}}

Test Case :: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,skip_cleanup=True,blob_generator=false



 Comments   
Comment by Parag Agarwal [ 27/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11828/1033_log.tar.gz
Comment by Parag Agarwal [ 27/Jul/14 ]
Errors observed @ couchdb log

[couchdb:error,2014-07-26T18:47:27.203,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.29753.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.249,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.29835.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.283,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.29905.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.327,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.29946.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.365,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.29975.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.518,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30127.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.557,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30162.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.595,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30193.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.637,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30247.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.672,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30302.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T18:47:27.711,ns_1@10.6.2.144:<0.28901.1>:couch_log:error:44]Cleanup process <0.30323.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped
[couchdb:error,2014-07-26T21:45:12.200,ns_1@10.6.2.144:<0.28899.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/default_view (prod/main)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-26T21:45:12.201,ns_1@10.6.2.144:<0.28909.1>:couch_log:error:44]upr client (default, mapreduce_view: default _design/default_view (prod/replica)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-26T21:45:17.282,ns_1@10.6.2.144:<0.29600.1>:couch_log:error:44]Set view `default`, main group `_design/default_view`, doc loader error
error: {timeout,{gen_server,call,[<0.28899.1>,{add_stream,105,0,0,7,4}]}}
[couchdb:error,2014-07-26T21:45:17.284,ns_1@10.6.2.144:<0.28890.1>:couch_log:error:44]Set view `default`, main (prod) group `_design/default_view`, received error from updater: {timeout,
[couchdb:info,2014-07-26T21:45:17.284,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122651> update monitor, reference 53, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.284,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121656> update monitor, reference 103, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.285,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121816> update monitor, reference 101, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.285,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121848> update monitor, reference 105, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.285,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121590> update monitor, reference 102, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.288,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121936> update monitor, reference 104, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.288,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121569> update monitor, reference 98, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.289,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122744> update monitor, reference 96, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.289,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121674> update monitor, reference 99, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.290,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122703> update monitor, reference 97, error: {updater_error,
[couchdb:info,2014-07-26T21:45:17.290,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121971> update monitor, reference 100, error: {updater_error,
                                                                                                                                        {error,
                                                                                                                                         {error,
[couchdb:info,2014-07-26T21:45:18.705,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122651> update monitor, reference 53, error: {shutdown,
                                                                                                                                                {error,
                                                                                                                                                 {error,
[couchdb:info,2014-07-26T21:45:18.706,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121656> update monitor, reference 103, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:info,2014-07-26T21:45:18.707,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121816> update monitor, reference 101, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:info,2014-07-26T21:45:18.708,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121848> update monitor, reference 105, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:info,2014-07-26T21:45:18.709,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121590> update monitor, reference 102, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:info,2014-07-26T21:45:18.710,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121936> update monitor, reference 104, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:info,2014-07-26T21:45:18.710,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121569> update monitor, reference 98, error: {shutdown,
                                                                                                                                                {error,
                                                                                                                                                 {error,
[couchdb:info,2014-07-26T21:45:18.711,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122744> update monitor, reference 96, error: {shutdown,
                                                                                                                                                {error,
                                                                                                                                                 {error,
[couchdb:info,2014-07-26T21:45:18.712,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121674> update monitor, reference 99, error: {shutdown,
                                                                                                                                                {error,
                                                                                                                                                 {error,
[couchdb:info,2014-07-26T21:45:18.712,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.122703> update monitor, reference 97, error: {shutdown,
                                                                                                                                                {error,
                                                                                                                                                 {error,
[couchdb:info,2014-07-26T21:45:18.713,ns_1@10.6.2.144:<0.28890.1>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, replying to partition #Ref<0.0.2.121971> update monitor, reference 100, error: {shutdown,
                                                                                                                                                 {error,
                                                                                                                                                  {error,
[couchdb:error,2014-07-26T21:50:10.926,ns_1@10.6.2.144:<0.30362.3>:couch_log:error:44]upr client (default, mapreduce_view: default _design/default_view (prod/replica)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[couchdb:error,2014-07-26T21:50:10.927,ns_1@10.6.2.144:<0.30343.3>:couch_log:error:44]upr client (default, mapreduce_view: default _design/default_view (prod/main)): upr receive worker failed due to reason: closed. Restarting upr receive worker...
[root@palm-10307 logs]#

Comment by Aleksey Kondratenko [ 28/Jul/14 ]
We've failed to add vbucket 123 because of this:

[ns_server:error,2014-07-26T21:54:25.393,ns_1@10.6.2.147:capi_set_view_manager-default<0.7528.5>:capi_set_view_manager:do_apply_vbucket_states:131]Failed to apply index states for the following ddocs:
[{<<"_design/default_view">>,
  {'EXIT',
      {{upr_died,
           {{badmatch,
                {ok,{state,#Port<0.30189>,5000,8,
                        {dict,1,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                            {{[],[],
                              [[6|{{add_stream,127},nil}]],
                              [],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                        {dict,0,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                              []}}},
                        [],<0.7753.5>,20971520,44,
                        {dict,1,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                            {{[],[],
                              [[6|{stream_info,127,0,0,10,{0,0},6}]],
                              [],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                        [<<"mapreduce_view: default _design/default_view (prod/main)">>,
                         <<"default">>,<<"_admin">>,
                         <<"d6bbed2a2a21e37c36bf93be3182beaf">>,20971520]}}},
            [{couch_upr_client,handle_info,2,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_upr/src/couch_upr_client.erl"},
                  {line,531}]},
             {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
             {proc_lib,init_p_do_apply,3,
                 [{file,"proc_lib.erl"},{line,239}]}]}},
       {gen_server,call,
           [<0.7561.5>,
            {mark_as_indexable,
                [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
                 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,
                 43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,
                 63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,
                 83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,
                 102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,
                 117,118,119,120,121,122,123,124,125,126,127]},
            infinity]}}}}]


Why we haven't failed right there instead of failing later is another story.
Comment by Sarath Lakshman [ 29/Jul/14 ]
Fixes for review:
http://review.couchbase.org/#/c/39984/
http://review.couchbase.org/#/c/39988/




[MB-11827] {UPR} :: Rebalance stuck with rebalance-out due to indexing stuck Created: 26/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.6.2.144-10.6.2.147

Triage: Untriaged
Is this a Regression?: Yes

 Description   
1033, centos 6x

1. Create 4 node cluster
2. Create default bucket
3. Add 1 K items
4. Create 3 views and start querying
5. Rebalance-out 1 node

Step 4 and Step 5 act in parallel

Rebalance is stuck

looked at couchdb log, found the following error across different machines in cluster

[couchdb:error,2014-07-26T18:47:27.450,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16158.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.557,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16199.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.745,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16230.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.783,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16240.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.831,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16250.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

[couchdb:error,2014-07-26T18:47:27.877,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16260.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

TEST CASE ::
./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=False,force_kill_memached=False,verify_unacked_bytes=True -t rebalance.rebalance_progress.RebalanceProgressTests.test_progress_rebalance_out,nodes_init=4,nodes_out=1,GROUP=P0,skip_cleanup=True,blob_generator=false

 Comments   
Comment by Parag Agarwal [ 26/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11827/1033log.tar.gz

Comment by Parag Agarwal [ 26/Jul/14 ]
Test Case failing:: http://qa.hq.northscale.net/view/3.0.0/job/centos_x64--02_05--Rebalance_Progress/

Check the first 6, rebalance hangs
Comment by Aleksey Kondratenko [ 26/Jul/14 ]
Indeed we're waiting for indexes.
Comment by Sarath Lakshman [ 28/Jul/14 ]
This looks like a duplicate of recently filed bug, MB-11786


[couchdb:info,2014-07-26T18:56:22.325,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.427,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.528,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
[couchdb:info,2014-07-26T18:56:22.629,ns_1@10.6.2.144:<0.29600.1>:couch_log:info:41]upr client (<0.28899.1>): Temporary failure on stream request on partition 105. Retrying...
Comment by Sarath Lakshman [ 28/Jul/14 ]
FYI,
[couchdb:error,2014-07-26T18:47:27.783,ns_1@10.6.2.146:<0.15845.1>:couch_log:error:44]Cleanup process <0.16240.1> for set view `default`, replica (prod) group `_design/default_view`, died with reason: stopped

This message is a harmless message. I am planning to reduce its log level. Will be doing this as part of final round of cleanup.
Comment by Mike Wiederhold [ 28/Jul/14 ]
http://review.couchbase.org/#/c/39960/




[MB-11826] Don't send unnecessary config updates. Created: 26/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Brett Lawson Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
The memcached service should keep track of the last revId that it sent to a client who is connected, and avoid dispatching an identical config to the client during a NMV. Currently, when pipelining operations, a client may receive hundreds of NMV responses when only a single one is necessary. This won't prevent multiple NMV's across nodes, but will prevent the bulk of the spam.

Note: An explicit request for the config via CCCP_GET_VBUCKET_CONFIG should always return the data regardless of revId info.

 Comments   
Comment by Jeff Morris [ 28/Jul/14 ]
This would alleviate the "spam" problem with configs during rebalance/swap/failover scenarios. +1.




[MB-11825] Rebalance may fail if cluster_compat_mode:is_node_compatible times out waiting for ns_doctor:get_node Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: customer, rebalance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: No

 Description   
Saw this in CBSE-1301:

 <0.2025.3344> exited with {{function_clause,
                            [{new_ns_replicas_builder,handle_info,
                              [{#Ref<0.0.4447.107509>,
                                [stale,
                                 {last_heard,{1406,315410,842219}},
                                 {now,{1406,315410,210848}},
                                 {active_buckets,
                                  ["user_reg","sentence","telemetry","common",
                                   "notifications","playlists","users"]},
                                 {ready_buckets,

which caused rebalance to fail.

The reason is that new_ns_replicas_builder doesn't have catch-all handle_info that's typical for gen_servers. And this message occurs because of the following call chain:

* new_ns_replicas_builder:init/1

* ns_replicas_builder_utils:spawn_replica_builder/5

* ebucketmigrator_srv:build_args

* cluster_compat_mode:is_node_compatible

* ns_doctor:get_node

ns_doctor:get_node handles timeout and returns empty list. So if this happens actual reply may be delivered later and be handled by handle_info. Which in this case is unable to do it.

3.0 is mostly immune to this particular chain of calls due to optimization:

commit 70badff90b03176b357cac4d03e40acc62f4861b
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Tue Oct 1 11:44:02 2013 -0700

    MB-9096: optimized is_node_compatible when cluster is compatible
    
    There's no need to check for particular node's compatibility with
    certain feature if entire cluster's mode is new enough.
    
    Change-Id: I9573e6b2049cb00d2adad709ba41ec5285d66a6b
    Reviewed-on: http://review.couchbase.org/29317
    Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
    Reviewed-by: Artem Stemkovski <artem@couchbase.com>


 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
http://review.couchbase.org/39908




[MB-11824] [system test] [kv unix] rebalance hang at 0% when add a node to cluster Created: 25/Jul/14  Updated: 28/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.4 64-bit

Attachments: Zip Archive 172.23.107.195-7252014-1342-diag.zip     Zip Archive 172.23.107.196-7252014-1345-diag.zip     Zip Archive 172.23.107.197-7252014-1349-diag.zip     Zip Archive 172.23.107.199-7252014-1352-diag.zip     Zip Archive 172.23.107.200-7252014-1356-diag.zip     Zip Archive 172.23.107.201-7252014-143-diag.zip     Zip Archive 172.23.107.202-7252014-1359-diag.zip     Zip Archive 172.23.107.203-7252014-146-diag.zip    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on 8 nodes
1:172.23.107.195
2:172.23.107.196
3:172.23.107.197
4:172.23.107.199
5:172.23.107.200
6:172.23.107.202
7:172.23.107.201

8:172.23.107.203

Create a cluster of 7 nodes
Create 2 buckets: default and sasl-2 (no view)
Load 25+ M items to each bucket to bring down active resident ratio down to 80%
Do update, expired and delete on both buckets in 3 hours.
Then add node 203 to cluster. Rebalance hang at 0%

Live cluster is available to debug


 Comments   
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809. We currently have two bug fixes in that fix rebalance stuck issues. (MB-11809 and MB-11786. Please run the tests with these changes merged before filing any other rebalance stuck issues.
Comment by Thuan Nguyen [ 28/Jul/14 ]
I could not repro this bug in build 3.0.0-1031 in kv only system test




[MB-11822] numWorkers setting of 5 is treated as high priority but should be treated as low priority. Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
https://github.com/couchbase/ep-engine/blob/master/src/workload.h#L44-48
we currently use the priority conversion formula as seen in above code snippet
this assign numWorkers setting of 5 high priority but the expectation is that <=5 is low priority.

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39891 thanks




[MB-11821] Rename UPR to DCP in stats and loggings Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sundar Sridharan Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
ep-engine side changes are http://review.couchbase.org/#/c/39898/ thanks




[MB-11820] beer-sample loading is stuck in crashed state (was: Rebalance not available 'pending add rebalanace', beer-sample loading is stuck) Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
4 node cluster
scenario

1 - Created beer-sample right after creation of the cluster
2 - Right after bucket started loading, auto-generated load started running on it
3 - After many many minutes, I added a few nodes and noticed that I couldn't rebalance. Digging in further, I saw that the beer-sample loading was still going on but not making any progress.

Logs are at:
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.74.148.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.87.131.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.0.243.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.21.69.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.22.57.zip

 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Converting this ticket to beer sample loading is stuck. Lack of rebalance warning is other existing and still in works ticket.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Here's what I have in my logs that's output from docloader:

5 matches for "output from beer-sample" in buffer: ns_server.debug.log
  19416:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - existing buckets : [u'beer-sample']\n"
  19417:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - found bucket beer-sample\n"
  19450:[ns_server:debug,2014-07-24T23:40:10.387,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "Traceback (most recent call last):\n File \"/opt/couchbase/lib/python/cbdocloader\", line 241, in ?\n main()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 233, in main\n"
  19451:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " docloader.populate_docs()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 191, in populate_docs\n self.unzip_file_and_upload()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 175, in unzip_file_and_upload\n self.enumerate_and_save(working_dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 155, in enumerate_and_save\n self.save_doc(dockey, fp)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 133, in save_doc\n self.bucket.set(dockey, 0, 0, raw_data)\n File \"/opt/couchbase/lib/python/couchbase/client.py\", line 232, in set\n self.mc_client.set(key, expiration, flags, value)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 927, in set\n"
  19452:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " return self._respond(item, event)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 883, in _respond\n raise item[\"response\"][\"error\"]\ncouchbase.couchbaseclient.MemcachedError: Memcached error #134: Temporary failure\n"

It don't know if docloader is truly stuck or if it is retrying and getting tmperrors all the time.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
I don't know who owns docloader but AFAIK it was Bin. I've also heard about some attempts to rewrite it in go.

CC-ed a bunch of possibly related folks.




[MB-11819] XDCR: Rebalance at destination hangs, missing replica items Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Duplicate Votes: 0
Labels: rebalance-hang
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 172.23.106.45-7242014-208-diag.zip     Zip Archive 172.23.106.46-7242014-2010-diag.zip     Zip Archive 172.23.106.47-7242014-2011-diag.zip     Zip Archive 172.23.106.48-7242014-2013-diag.zip    
Issue Links:
Duplicate
duplicates MB-11809 {UPR}:: Rebalance-in of 2 nodes is st... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-1014

Scenario
------------
1. Uni-xdcr between 2-node clusters, default bucket
2. Load 30K items on source
3. Pause XDCR
4. Start "rebalance-out" of one node each from both clusters simultaneously.
5. Resume xdcr

Rebalance at source proceeds to completion, rebalance on dest hangs at 10%, see -

',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at cluster 172.23.106.47
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:07,325] - [rest_client:1216] INFO - rebalance percentage : 100 %
[2014-07-24 13:28:30,222] - [task:411] INFO - rebalancing was completed with progress: 100% in 83.475001812 sec
[2014-07-24 13:28:30,223] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:28:30,229] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:40,252] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:50,280] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:00,301] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:10,342] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:20,363] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:30,389] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:40,410] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:50,437] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:00,458] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:10,480] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:20,504] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:30,523] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:40,546] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:50,569] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %

Testcase
--------------
./testrunner -i uni-xdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=unidirection,ctopology=chain,replication_type=xmem,rebalance_out=source-destination,pause=source,GROUP=P1


The rebalance hang to explain the missing replica items?

[2014-07-24 13:31:49,079] - [task:463] INFO - Saw curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,103] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,343] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:49,536] - [task:463] INFO - Saw vb_active_curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,559] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,811] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:50,001] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:31:55,045] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:00,080] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:05,113] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Logs
-------------
will attach cbcollect with xdcr trace logging.

 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Do you have _any reason at all_ to believe that it's even remotely related to xdcr ? Specifically xdcr does nothing about upr replicas.
Comment by Aruna Piravi [ 24/Jul/14 ]
I, of course _do_ know that replicas have nothing to do with xdcr. But I'm unsure if xdcr, and parallel rebalance contributed to the hang.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
I cannot diagnose stuck rebalance when logs are capture after cleanup.
Comment by Aruna Piravi [ 24/Jul/14 ]
And more on why I think so ---

Pls note from logs below that there has been no progress in rebalance at the destination _from_ the time we resumed xdcr. Until then it had progressed to 10%.

[2014-07-24 13:26:59,500] - [pauseResumeXDCR:92] INFO - ##### Pausing xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:26:59,541] - [rest_client:1757] INFO - Updated pauseRequested=true on bucket'default' on 172.23.106.45
[2014-07-24 13:26:59,968] - [task:517] WARNING - Not Ready: xdc_ops 1734 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:27:00,145] - [task:521] INFO - Saw replication_docs_rep_queue 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:00,339] - [task:517] WARNING - Not Ready: replication_active_vbreps 16 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091', default bucket
[2014-07-24 13:27:05,490] - [task:521] INFO - Saw xdc_ops 0 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:27:05,697] - [task:521] INFO - Saw replication_active_vbreps 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at source cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at source cluster 172.23.106.47
[2014-07-24 13:27:05,761] - [xdcrbasetests:372] INFO - sleep for 5 secs. ...
[2014-07-24 13:27:06,733] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 13:27:06,746] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,773] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 13:27:06,796] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:10,823] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:27:10,860] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 13:27:11,101] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 894
[2014-07-24 13:27:11,102] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:12,043] - [task:521] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 13:27:12,260] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 869
[2014-07-24 13:27:12,261] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:13,142] - [task:521] INFO - Saw xdc_ops 4770 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
Comment by Aruna Piravi [ 24/Jul/14 ]
Live cluster

http://172.23.106.45:8091/
http://172.23.106.47:8091/ <-- rebalance stuck
Comment by Aruna Piravi [ 24/Jul/14 ]
New logs attached.
Comment by Aruna Piravi [ 24/Jul/14 ]
Didn't try pausing replication from source cluster. Wanted the leave the cluster in same state.

.47 started receiving data through resumed xdcr from 20:04:01. The last recorded rebalance progress was 8.7890625 % at 20:04:05 on .47. Could have stopped a few secs before that.

[2014-07-24 20:03:55,538] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 20:03:55,547] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,569] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 20:03:55,578] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,584] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:55,592] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:59,629] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 20:03:59,665] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 20:03:59,799] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1010
[2014-07-24 20:03:59,800] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:00,803] - [task:523] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 20:04:01,019] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1082
[2014-07-24 20:04:01,020] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:01,877] - [task:523] INFO - Saw xdc_ops 4981 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 20:04:01,888] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 20:04:05,894] - [rest_client:1216] INFO - rebalance percentage : 10.7421875 %
[2014-07-24 20:04:05,905] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:15,927] - [rest_client:1216] INFO - rebalance percentage : 19.53125 %
[2014-07-24 20:04:15,937] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:25,956] - [rest_client:1216] INFO - rebalance percentage : 26.7578125 %
[2014-07-24 20:04:25,964] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:35,995] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 20:04:36,007] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:46,030] - [rest_client:1216] INFO - rebalance percentage : 50.9114583333 %
[2014-07-24 20:04:46,037] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:56,060] - [rest_client:1216] INFO - rebalance percentage : 59.7005208333 %
[2014-07-24 20:04:56,068] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:05:06,087] - [rest_client:1216] INFO - rebalance percentage : 99.9348958333 %
[2014-07-24 20:05:06,096] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Same symptoms as MB-11809:

     {<0.4446.17>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007fdb6c22ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007fdb1022d3a8 Return addr 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.12.179156>">>,<<"y(1) infinity">>,
                   <<"y(2) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.147.17>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007fdb1022d3e0 Return addr 0x00007fdb1b1ed020 (janitor_agent:'-spawn_rebalance_subprocess/3-fun-0-'/3 + 200)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) Catch 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007fdb1022d408 Return addr 0x00007fdb6c2338a0 (proc_lib:init_p/3 + 688)">>,
                   <<"y(0) <0.160.17>">>,<<>>,
                   <<"0x00007fdb1022d418 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007fdb6c2338c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,0}]},
       {heap_size,233},
       {total_heap_size,233},
       {links,[<0.160.17>,<0.186.17>]},
       {memory,2816},
       {message_queue_len,0},
       {reductions,29},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Aruna, consider pausing xdcr. It is likely unrelated to xdcr given MB- reference above
Comment by Aruna Piravi [ 25/Jul/14 ]
I paused xdcr last night. No progress on rebalance yet. That rules out xdcr completely?
Comment by Aruna Piravi [ 25/Jul/14 ]
Raising as test blocker. ~10 tests failed to this rebalance hang problem. Feel free to close if found to be a duplicate if MB-11809.
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809




[MB-11818] couchbase cli in cluster-wide collectinfo failed to start to collect selected nodes Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on 4 nodes
Run couchbase cli to do cluster-wide collectinfo on one node
The collection failed to start

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149


 Comments   
Comment by Bin Cui [ 25/Jul/14 ]
I am confused. Are you sure you want to use collect-logs-stop to start collecting ?
Comment by Thuan Nguyen [ 25/Jul/14 ]
Oop I copy the wrong command
Here is command failed to start collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
Comment by Bin Cui [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39889/




[MB-11817] cluster-wide cli does not printout result success or failed when start collect log Created: 24/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on one node
Run root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --allnodes
root@ubuntu:~#
Collection start showing in UI but in command line, it shows nothing. I don't know if it is success or failed


 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
http://review.couchbase.org/#/c/39962/




[MB-11816] coucbase-cli failed to collect log in cluster-wide collection Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.deb.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on one ubuntu 12.04 node
Run cluster-wide collectinfo using couchbase-cli
Failed to collect

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c localhost:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39848/
Comment by Thuan Nguyen [ 25/Jul/14 ]
Verified on build 3.0.0-1028. This bug was fixed.




[MB-11815] Support Ubuntu 14.04 as supported platform Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Anil Kumar Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We need to add support for Ubuntu 14.04.




[MB-11814] Failover decision at bucket level Created: 24/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Parag Agarwal Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Scenario Example

Cluster of 4 nodes with all nodes healthy has two buckets with replica=0 and 1. When doing failover of a node we get a hard failover option since we consider the minimum replica presence across all buckets to make this choice. In order to do graceful failover, we would require replica =1 at least across all buckets

This can be improved by letting the system decide if graceful failover of a particular bucket is possible based on its replica and not another bucket with lesser replica count (which qualifies for hard failover only). Since graceful failover avoids data loss as compared to hard failover, we will reduce data loss situation.



 Comments   
Comment by Dave Rigby [ 25/Jul/14 ]
Similary for auto-failover - currently if you have a bucket with zero replicas it essentially "blocks" auto-failover of another bucket which does have replicas.




[MB-11813] windows 64-bit buildbot failed to build new 64-bit builds. Failed to throw out error Created: 24/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Windows 64-bit buildbot failed to build new build
http://builds.hq.northscale.net:8010/builders/server-30-win-x64-300/builds/411
No errors throw out.

 Comments   
Comment by Thuan Nguyen [ 24/Jul/14 ]
This 64-bit builder shows build successful but actually no build built.
Comment by Chris Hillery [ 24/Jul/14 ]
The build isn't performed by buildbot; buildbot only spawns the Jenkins job:

http://factory.couchbase.com/job/cs_300_win6408/

And that job is still ongoing.




[MB-11812] Need a read-only mode to startup the query server Created: 24/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: query
Affects Version/s: cbq-DP4
Fix Version/s: cbq-DP4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Don Pinto Assignee: Gerald Sangudi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
This is required for the tutorial in production as we don't want any user to blow off the data, or add additional data.

All DML queries should be blocked when the server is started in this mode. Only the admin should be start the query server in read-only mode.






[MB-11811] [Tools] Change UPR to DCP for tools Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39814/




[MB-11810] No feedback (similar to rebalane fails) if installation of sample buckets fails... Created: 24/Jul/14  Updated: 24/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Improvement Priority: Major
Reporter: Trond Norbye Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I tried to install the sample buckets on my windows server, and the buckets is created but there is no data in them when I look at them. I would have expected a "red error stripe" just like I get with for instance rebalance errors so that I know that something went wrong.

I took a look in the "log" section I see what my mum would call a cryptic error message:

Loading sample bucket gamesim-sample failed: {failed_to_load_samples_with_status,
1}

When I tried to run the program cbdocloader with the appropriate arguments I get a more informative error message:

"Failed to listen listen unix /tmp/log_upr_client.sock: Det ble brukt en adresse som var inkompatibel med den forespurte protokollen"

(I'm using my go version of cbdocloader which someone just modified to use the retriever logging project which use unix sockets which don't work on windows).

It would be nice if we could add the output from the process in the log. it would make it easier to debug the problem for customers (they will probably not know the name (and the arguments) for the binary we tried to use)


 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Deserves proper design.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Also please note that we don't capture output of samples loader because:

a) erlang doesn't allow us to read stdout and stderr separately

b) original docloader is quite noisy

Things might indeed change once we have a better loader implementation that can said to output only errors.




[MB-11809] {UPR}:: Rebalance-in of 2 nodes is stuck when doing Ops Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by MB-11819 XDCR: Rebalance at destination hangs,... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1014, centos 6x

Vms:: 10.6.2.144-150

1. Create 7 node cluster
2. Create default bucket
3. Add 400 K items
4. Do mutations and rebalance-out 2 nodes
5. Do mutations and rebalance-in 2 nodes

Step 5 leads to rebalance being stuck

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.incremental_rebalance_out_in_with_mutation,init_num_nodes=3,items=400000,skip_cleanup=True,GROUP=IN_OUT;P0


 Comments   
Comment by Parag Agarwal [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11809/log.tar.gz
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Takeover request appear to be stuck. Thats on node .147.

     {<19779.11046.0>,
      [{registered_name,'replication_manager-default'},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f1b1d12ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f1ad3083860 Return addr 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.0.169038>">>,<<"y(1) infinity">>,
                   <<"y(2) {takeover,78}">>,<<"y(3) '$gen_call'">>,
                   <<"y(4) <0.11353.0>">>,<<"y(5) []">>,<<>>,
                   <<"0x00007f1ad3083898 Return addr 0x00007f1acbd79e70 (replication_manager:handle_call/3 + 2840)">>,
                   <<"y(0) infinity">>,<<"y(1) {takeover,78}">>,
                   <<"y(2) 'upr_replicator-default-ns_1@10.6.2.146'">>,
                   <<"y(3) Catch 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f1ad30838c0 Return addr 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<"y(0) [{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FM\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.149',\"789\"}]">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<>>,
                   <<"0x00007f1ad30838d8 Return addr 0x00007f1b1d133ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) replication_manager">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) <0.11029.0>">>,
                   <<"y(4) {dcp_takeover,'ns_1@10.6.2.146',78}">>,
                   <<"y(5) {<0.11528.0>,#Ref<0.0.0.169027>}">>,
                   <<"y(6) Catch 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<>>,
                   <<"0x00007f1ad3083918 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f1b1d133ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,42}]},
       {heap_size,610},
       {total_heap_size,2208},
       {links,[<19779.11029.0>]},
       {memory,18856},
       {message_queue_len,2},
       {reductions,17287},
       {trap_exit,true}]}
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39894




[MB-11808] GeoSpatial in 3.0 Created: 24/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, UI, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We must hide GeoSpatial related UI elements in 3.0 release, as we have not completed the task of moving GeoSpatial features over to UPR.

We should use the simplest way to hide elements (like "display:none" attribute) because we fully expect to resurface this in 3.0.1


 Comments   
Comment by Sriram Melkote [ 24/Jul/14 ]
In the 3.0 release meeting, it was fairly clear that we won't be able to add Geo support for 3.0 due to the release being in Beta phase now and heading to code freeze soon. So, we should plan for it in 3.0.1 - updating description to reflect this.




[MB-11807] couchbase server failed to start in ubuntu when upgrade from 2.0 to 3.0 if it could not find the database Created: 23/Jul/14  Updated: 28/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 2.0 in one ubuntu 12.04 64-bit
Initialize it with custom data and index path (/tmp/data and /tmp/index)
Create default bucket
Load 1K items to bucket
Shutdown couchbase server
Remove all files under /tmp/data/ and /tmp/index
Upgrade couchbase server to 3.0.0-995
Couchbase server failed to start due to could not find database.
Manually start couchbase server. Couchbase server starts normally with no items in bucket as expected.

The point here is that couchbase server should start even it could not find database files

It may relate to bug MB-7705

 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
First, I really don't think this is a valid test case. If data directory and index directory are gone, config.dat becomes obsolete and upgrade script won't be able to identify old directories and retrieve host information for upgrade to proceed. Logically, it really doesn't matter whether you cannot proceed and finish the upgrade but start as a branch new node, or you simply fail the upgrade process. Because of data loss, it will be equivalent to install a new setup, i.e. a node failover.
Comment by Thuan Nguyen [ 28/Jul/14 ]
In this case, it does not matter data is in the node or not, or upgrade or not upgrade, couchbase server does not start after done the installation.
Comment by Anil Kumar [ 28/Jul/14 ]
Bin - Discussed with Tony. Looks like we have inconsistency in the way this particular scenario works on each platform. In this scenario on CentOS Couchbase Server starts back with error message that it cannot find the data files whereas on Ubuntu Couchbase Server crashes and doesn't start.





[MB-11806] rebalance should not be allowed when cbrecovery is stopped by REST API or has not completed Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Ashvinder Singh Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: ns_server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos, ubuntu

Triage: Untriaged
Is this a Regression?: Yes

 Description   
Found in build-3.0.0-973-release

Setup: Two clusters: src and dst with 3 nodes each. Please have 2 spare nodes
- Setup xdcr between src and dst cluster
- Ensure xdcr is setup and complete
- Hard Failover two nodes from dst cluster
- Verify nodes failover
- Add two spare nodes in dst cluster
- Initiate cbrecovery from src to dst
- stop cbrecovery using REST API
http://10.3.121.106:8091//pools/default/buckets/default/controller/stopRecovery?recovery_uuid=3ad71c7b3365593e0979da34306fb2a5

- initiate rebalance operation on dst cluster.

Observations: rebalance operation starts.
Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interfaces.


 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
First of all you're doing it wrong here. Intended use of cbrecovery is to recover _source_ by using data from destination.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
stop recovery is stop recovery. We do allow rebalance in this case by design.
Comment by Andrei Baranouski [ 24/Jul/14 ]
Alk, I do not agree regarding "Expectation: Since rebalance operation is disallowed from UI when recovery is ongoing (or halted). The rebalance should not be allowed from REST or cli interface"
I think we shouldn't have possibility to trigger via rest if we can't do it on UI
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Steps don't match that "shouldn't". Feel free to file proper bug for "UI doesn't allow but REST does allow" with all proper details and evidence.




[MB-11805] KV+ XDCR System test: Missing items in bi-xdcr only Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-998

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
Free for investigation. Not attaching data files.

Steps
--------
1a. Load on both clusters till vb_active_resident_items_ratio < 50.
1b. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one
Verify item count

Problem
-------------
standardbucket(C1) <---> standardbucket(C2)
On C1 - 57890744 items
On C2 - 57957032 items
standardbucket1(C1) ----> standardbucket1(C2)
On C1 - 14053020 items
On C2 - 14053020 items

Total number of missing items : 66,288

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.


Attached
-------------
cbcollect and list of keys that are missing on vb0


Missing keys
-------------------
Atleast 50-60 keys missing in every vbucket. Attaching all missing keys from vb0

vb0
-------
{'C1_node:': u'172.23.105.44',
'vb': 0,
'C2_node': u'172.23.105.54',
'C1_key_count': 78831,
 'C2_key_count': 78929,
 'missing_keys': 98}

     id: 06FA8A8B-11_110 deleted, tombstone exists
     id: 06FA8A8B-11_1354 present, report a bug!
     id: 06FA8A8B-11_1426 present, report a bug!
     id: 06FA8A8B-11_2175 present, report a bug!
     id: 06FA8A8B-11_2607 present, report a bug!
     id: 06FA8A8B-11_2797 present, report a bug!
     id: 06FA8A8B-11_3871 deleted, tombstone exists
     id: 06FA8A8B-11_4245 deleted, tombstone exists
     id: 06FA8A8B-11_4537 present, report a bug!
     id: 06FA8A8B-11_662 deleted, tombstone exists
     id: 06FA8A8B-11_6960 present, report a bug!
     id: 06FA8A8B-11_7064 present, report a bug!
     id: 3600C830-80_1298 present, report a bug!
     id: 3600C830-80_1308 present, report a bug!
     id: 3600C830-80_2129 present, report a bug!
     id: 3600C830-80_4219 deleted, tombstone exists
     id: 3600C830-80_4389 deleted, tombstone exists
     id: 3600C830-80_7038 present, report a bug!
     id: 3FEF1B93-91_2890 present, report a bug!
     id: 3FEF1B93-91_2900 present, report a bug!
     id: 3FEF1B93-91_3004 present, report a bug!
     id: 3FEF1B93-91_3194 present, report a bug!
     id: 3FEF1B93-91_3776 deleted, tombstone exists
     id: 3FEF1B93-91_753 present, report a bug!
     id: 52D6D916-120_1837 present, report a bug!
     id: 52D6D916-120_3282 present, report a bug!
     id: 52D6D916-120_3312 present, report a bug!
     id: 52D6D916-120_3460 present, report a bug!
     id: 52D6D916-120_376 deleted, tombstone exists
     id: 52D6D916-120_404 deleted, tombstone exists
     id: 52D6D916-120_4926 present, report a bug!
     id: 52D6D916-120_5022 present, report a bug!
     id: 52D6D916-120_5750 present, report a bug!
     id: 52D6D916-120_594 deleted, tombstone exists
     id: 52D6D916-120_6203 present, report a bug!
     id: 5C12B75A-142_2889 present, report a bug!
     id: 5C12B75A-142_2919 present, report a bug!
     id: 5C12B75A-142_569 deleted, tombstone exists
     id: 73C89FDB-102_1013 present, report a bug!
     id: 73C89FDB-102_1183 present, report a bug!
     id: 73C89FDB-102_1761 present, report a bug!
     id: 73C89FDB-102_2232 present, report a bug!
     id: 73C89FDB-102_2540 present, report a bug!
     id: 73C89FDB-102_4092 deleted, tombstone exists
     id: 73C89FDB-102_4102 deleted, tombstone exists
     id: 73C89FDB-102_668 deleted, tombstone exists
     id: 87B03DB1-62_3369 present, report a bug!
     id: 8DA39D2B-131_1949 present, report a bug!
     id: 8DA39D2B-131_725 deleted, tombstone exists
     id: A2CC835C-00_2926 present, report a bug!
     id: A2CC835C-00_3022 present, report a bug!
     id: A2CC835C-00_3750 present, report a bug!
     id: A2CC835C-00_5282 present, report a bug!
     id: A2CC835C-00_5312 present, report a bug!
     id: A2CC835C-00_5460 present, report a bug!
     id: A2CC835C-00_6133 present, report a bug!
     id: A2CC835C-00_6641 present, report a bug!
     id: A5C9F867-33_1091 present, report a bug!
     id: A5C9F867-33_1101 present, report a bug!
     id: A5C9F867-33_1673 present, report a bug!
     id: A5C9F867-33_2320 present, report a bug!
     id: A5C9F867-33_2452 present, report a bug!
     id: A5C9F867-33_4010 deleted, tombstone exists
     id: A5C9F867-33_4180 deleted, tombstone exists
     id: CD7B0436-153_3638 present, report a bug!
     id: CD7B0436-153_828 present, report a bug!
     id: D94DA3B2-51_829 present, report a bug!
     id: DE161E9D-40_1235 present, report a bug!
     id: DE161E9D-40_1547 present, report a bug!
     id: DE161E9D-40_2014 present, report a bug!
     id: DE161E9D-40_2184 present, report a bug!
     id: DE161E9D-40_2766 present, report a bug!
     id: DE161E9D-40_3880 deleted, tombstone exists
     id: DE161E9D-40_3910 deleted, tombstone exists
     id: DE161E9D-40_4324 deleted, tombstone exists
     id: DE161E9D-40_4456 deleted, tombstone exists
     id: DE161E9D-40_6801 present, report a bug!
     id: DE161E9D-40_6991 present, report a bug!
     id: DE161E9D-40_7095 present, report a bug!
     id: DE161E9D-40_7105 present, report a bug!
     id: DE161E9D-40_940 present, report a bug!
     id: E9F46ECC-22_173 deleted, tombstone exists
     id: E9F46ECC-22_2883 present, report a bug!
     id: E9F46ECC-22_2913 present, report a bug!
     id: E9F46ECC-22_3017 present, report a bug!
     id: E9F46ECC-22_3187 present, report a bug!
     id: E9F46ECC-22_3765 deleted, tombstone exists
     id: E9F46ECC-22_5327 present, report a bug!
     id: E9F46ECC-22_5455 present, report a bug!
     id: E9F46ECC-22_601 deleted, tombstone exists
     id: E9F46ECC-22_6096 present, report a bug!
     id: E9F46ECC-22_6106 present, report a bug!
     id: E9F46ECC-22_6674 present, report a bug!
     id: E9F46ECC-22_791 present, report a bug!
     id: ECD6BE16-113_2961 present, report a bug!
     id: ECD6BE16-113_3065 present, report a bug!
     id: ECD6BE16-113_3687 present, report a bug!
     id: ECD6BE16-113_3717 present, report a bug!

74 undeleted key(s) present on C2(.54) compared to C1(.44)











 Comments   
Comment by Aruna Piravi [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11805/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11805/C2.tar
Comment by Aruna Piravi [ 25/Jul/14 ]
[7/23/14 1:40:12 PM] Aruna Piraviperumal: hi Mike, I see some backfill stmts like in MB-11725 but that doesn't lead to any missing items
[7/23/14 1:40:13 PM] Aruna Piraviperumal: 172.23.105.47
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
 
172.23.105.50


172.23.105.59


172.23.105.62


172.23.105.45
/opt/couchbase/var/lib/couchbase/logs/memcached.log.27.txt:Tue Jul 22 16:02:46.470085 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-2ad6ab49733cf45595de9ee568c05798 - (vb 421) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.48


172.23.105.52
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.44
[7/23/14 1:56:12 PM] Michael Wiederhold: Having one of those isn't necessarily bad. Let me take a quick look
[7/23/14 2:02:49 PM] Michael Wiederhold: Ok this is good. I'll debug it a little bit more. Also, I don't necessarily expect that data loss will always occur because it's possible that the items could have already been replicated.
[7/23/14 2:03:38 PM] Aruna Piraviperumal: ok
[7/23/14 2:03:50 PM] Aruna Piraviperumal: I'm noticing data loss on standard bucket though
[7/23/14 2:04:19 PM] Aruna Piraviperumal: but no such disk snapshot logs found for 'standardbucket'
Comment by Mike Wiederhold [ 25/Jul/14 ]
For vbucket 0 in the logs I see that on the source side we have high seqno 102957, but on the destination we only have up to seqno 97705 so it appears that some items were not sent to the remote side. I also see in the logs that xdcr did request those items as shown in the log messages below.

memcached<0.78.0>: Wed Jul 23 12:30:02.506513 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 95291 and end seqno 0
memcached<0.78.0>: Wed Jul 23 13:30:01.683760 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) stream created with start seqno 95291 and end seqno 102957
memcached<0.78.0>: Wed Jul 23 13:30:02.070134 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) Stream closing, 0 items sent from disk, 7666 items sent from memory, 102957 was last seqno sent
[ns_server:info,2014-07-23T13:30:10.753,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Wed Jul 23 13:30:10.552586 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 102957 and end seqno 0
Comment by Mike Wiederhold [ 25/Jul/14 ]
Alk,

See my comments above. Can you verify that all items were sent by the xdcr module correctly?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Let me quickly note that .tar is again in fact .tar.gz.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
missing:

a) data files (so that I can double-check your finding)

b) xdcr traces
Comment by Aruna Piravi [ 25/Jul/14 ]
1. For system tests, data files are huge, I did not attach them, the cluster is available.
2. xdcr traces were not enabled for this run, my apologies but we discard all info we have in hand? Another complete run will take 3 days. I'm not sure if we want to delay investigation for that long.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
There's no way to investigate such delicate issue without having at least traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
If all files are large you can at least attach that vbucket 0 where you found discrepancies.
Comment by Aruna Piravi [ 25/Jul/14 ]
> There's no way to investigate such delicate issue without having at least traces.
If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

>If all files are large you can at least attach that vbucket 0 where you found discrepancies.
 I can, if requested. The cluster was anyway left available.

Fine, let me do another run if there's no way to work around not having traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
>> > There's no way to investigate such delicate issue without having at least traces.

>> If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

Not possible. We log potentially critical information. But _your_ tests are all semi-automated right? So for your automation it makes sense indeed to always enable xdcr tracing.
Comment by Aruna Piravi [ 25/Jul/14 ]
System test is completely automated. Only the post-test verification is not. But enabling tracing is now a part of the framework.




[MB-11804] [Windows] Memcached error #132 'Internal error': Internal error for vbucket... when set key to bucket Created: 23/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: windows 2008 R2 64-bit

Attachments: Zip Archive 172.23.107.124-7232014-1631-diag.zip     Zip Archive 172.23.107.125-7232014-1633-diag.zip     Zip Archive 172.23.107.126-7232014-1634-diag.zip     Zip Archive 172.23.107.127-7232014-1635-diag.zip    
Triage: Untriaged
Operating System: Windows 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build from centos build. http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-999-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Test warmup test in build 3.0.0-999 on 4 nodes windows 2008 R2 64-bit
python testrunner.py -i ../../ini/4-w-sanity-new.ini -t warmupcluster.WarmUpClusterTest.test_warmUpCluster,num_of_docs=100

The test failed when it loaded keys to bucket default. This test passed in both centos 6.4 and ubuntu 12.04 64-bit





[MB-11803] {UPR}:: Rebalance-out failing due to bad replicators Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 10.5.2.13
10.5.2.14
10.5.2.15
10.3.121.63
10.3.121.64
10.3.121.66
10.3.121.69

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Tested on 1011, 1005, both ubuntu and centos are seeing this issue

1. Create a 7 node cluster
2. Create a default bucket
3. Add 100 K items
4. Rebalance-out 1 Node (10.3.121.69)
5. Do Ops for Gets

Step 4 and Step 5 act in parallel.

Rabalance exits with the following error::

Bad replicators after rebalance:
Missing = [{'ns_1@10.3.121.63','ns_1@10.3.121.64',0},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',1},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',2},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',3},
{'ns_1@10.3.121.63','ns_1@10.3.121.64',56},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',4},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',5},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',6},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',57},
{'ns_1@10.3.121.63','ns_1@10.3.121.66',58},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',19},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',20},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',21},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',22},
{'ns_1@10.3.121.64','ns_1@10.3.121.63',59},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',23},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',24},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',25},
{'ns_1@10.3.121.64','ns_1@10.3.121.66',60},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',26},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',29},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',30},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',31},
{'ns_1@10.3.121.64','ns_1@10.5.2.13',61},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',38},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',39},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',40},
{'ns_1@10.3.121.66','ns_1@10.3.121.63',62},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',41},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',42},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',43},
{'ns_1@10.3.121.66','ns_1@10.3.121.64',63},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',44},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',47},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',48},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',49},
{'ns_1@10.3.121.66','ns_1@10.5.2.13',64},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',65},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',74},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',75},
{'ns_1@10.5.2.13','ns_1@10.3.121.63',76},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',66},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',77},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',78},
{'ns_1@10.5.2.13','ns_1@10.3.121.64',79},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',67},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',80},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',81},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',82},
{'ns_1@10.5.2.13','ns_1@10.3.121.66',83},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',68},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',92},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',93},
{'ns_1@10.5.2.14','ns_1@10.3.121.63',94},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',69},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',95},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',96},
{'ns_1@10.5.2.14','ns_1@10.3.121.64',97},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',71},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',110},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',111},
{'ns_1@10.5.2.15','ns_1@10.3.121.63',112},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',72},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',113},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',114},
{'ns_1@10.5.2.15','ns_1@10.3.121.64',115},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',73},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',116},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',117},
{'ns_1@10.5.2.15','ns_1@10.3.121.66',118}]
Extras = []

Test Case:: ./testrunner -i centos.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_get_random_key,nodes_out=1,items=100000,value_size=256,skip_cleanup=True,GROUP=OUT;BASIC;P0;FROM_2_0

Will attach logs asap

 Comments   
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
This is relatively easily reproducible on cluster_run. I'm seeing upr disconnects which explain bad_replicas.

Might be duplicate of another upr disconnects bug.
Comment by Parag Agarwal [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11803/logs.tar.gz
Comment by Mike Wiederhold [ 23/Jul/14 ]
http://review.couchbase.org/#/c/39760/
Comment by Parag Agarwal [ 23/Jul/14 ]
Does not repro in 1014




[MB-11802] [BUG BASH] Sample Bug Created: 23/Jul/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Don Pinto Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Sample test bug for bug bash - Ignore




[MB-11801] It takes almost 2x more time to rebalance 10 empty buckets Created: 23/Jul/14  Updated: 23/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Abhinav Dangeti
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-881

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: PNG File reb_empty.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/400/artifact/
Is this a Regression?: Yes

 Description   
Rebalance-in, 3 -> 4, 10 empty buckets

There was only one change:
http://review.couchbase.org/#/c/34501/




[MB-11800] cbworkloadgen failed to run in rhel 6.5 Created: 23/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0.1, 2.5.1
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Cédric Delgehier Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Red Hat Enterprise Linux Server release 6.5 (Santiago)
kernel 2.6.32-431.20.3.el6.x86_64

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
After installing Couchbase,

I tried a cbworkloadgen, but I get an error :

{noformat}
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen --version
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen -n localhost:8091
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
{noformat}

Versions tested:
couchbase-server-2.0.1-170.x86_64
couchbase-server-2.5.1-1083.x86_64

 Comments   
Comment by Bin Cui [ 23/Jul/14 ]
First, please check if libcouchstore.so is under /opt/couchbase/lib. If yes, please check if the following python script can run correctly

import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)
Comment by Bin Cui [ 23/Jul/14 ]
The problem is possibly caused by wrong permission for ctypes module.

http://review.couchbase.org/#/c/39764/
Comment by Cédric Delgehier [ 23/Jul/14 ]
[root@rhel65_64 ~]# ls -al /opt/couchbase/lib/libcouchstore.so
lrwxrwxrwx 1 bin bin 22 Jul 22 14:51 /opt/couchbase/lib/libcouchstore.so -> libcouchstore.so.1.0.0

---

[root@rhel65_64 ~]# cat test.py
#!/usr/bin/env python
# -*-python-*-

import traceback, sys
import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)

[root@rhel65_64 ~]# python test.py
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    _lib = ctypes.CDLL(lib)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory

---

[root@rhel65_64 ~]# python -c "import sys; print sys.version_info[1]"
6

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysqlite2
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2

[root@rhel65_64~]# ls -al /opt/couchbase/lib/python/pysqlite2/*
-rw----r-- 1 1001 1001 2624 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dbapi2.py
-rw------- 1 root root 2684 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/dbapi2.pyc
-rw----r-- 1 1001 1001 2350 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dump.py
-rw----r-- 1 1001 1001 1020 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/__init__.py
-rw------- 1 root root 134 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/__init__.pyc
-rwx---r-- 1 1001 1001 1253220 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/_sqlite.so

/opt/couchbase/lib/python/pysqlite2/test:
total 120
drwx---r-- 3 1001 1001 4096 Jul 22 14:52 .
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 ..
-rw----r-- 1 1001 1001 29886 Jul 22 14:52 dbapi.py
-rw----r-- 1 1001 1001 1753 Jul 22 14:52 dump.py
-rw----r-- 1 1001 1001 7942 Jul 22 14:52 factory.py
-rw----r-- 1 1001 1001 6569 Jul 22 14:52 hooks.py
-rw----r-- 1 1001 1001 1966 Jul 22 14:52 __init__.py
drwx---r-- 2 1001 1001 4096 Jul 22 14:52 py25
-rw----r-- 1 1001 1001 10443 Jul 22 14:52 regression.py
-rw----r-- 1 1001 1001 7356 Jul 22 14:52 transactions.py
-rw----r-- 1 1001 1001 15200 Jul 22 14:52 types.py
-rw----r-- 1 1001 1001 13217 Jul 22 14:52 userfunctions.py

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysnappy2_24
ls: cannot access /opt/couchbase/lib/python/pysnappy2_24: No such file or directory
[root@rhel65_64~]# locate pysnappy
[root@rhel65_64~]#

---

As an indication, for version 4:

[root@rhel65_64~]# ls -al /usr/lib64/python2.6/lib-dynload/_ctypes.so
-rwxr-xr-x 1 root root 123608 Nov 21 2013 /usr/lib64/python2.6/lib-dynload/_ctypes.so
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/
drwxr-xr-x. 3 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/*
-rw-r--r-- 1 root root 2041 Nov 22 2010 /usr/lib64/python2.6/ctypes/_endian.py
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyc
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyo
-rw-r--r-- 1 root root 17004 Nov 22 2010 /usr/lib64/python2.6/ctypes/__init__.py
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyc
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyo
drwxr-xr-x. 2 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/macholib
-rw-r--r-- 1 root root 8531 Nov 22 2010 /usr/lib64/python2.6/ctypes/util.py
-rw-r--r-- 1 root root 8376 Mar 20 2010 /usr/lib64/python2.6/ctypes/util.py.binutils-no-dep
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyc
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyo
-rw-r--r-- 1 root root 5349 Nov 22 2010 /usr/lib64/python2.6/ctypes/wintypes.py
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyc
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyo



Comment by Bin Cui [ 24/Jul/14 ]
Check if we support rhel 6.5 or not
Comment by Cédric Delgehier [ 24/Jul/14 ]
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#supported-platforms
Comment by Cédric Delgehier [ 25/Jul/14 ]
So if I understand the implied, you tell me to do a rollback of the security patches until version 6.3, is that it?
Comment by Anil Kumar [ 28/Jul/14 ]
Cédric - I can appreciate your concern. We haven't tested our software on the RHEL 6.5 yet its on our roadmap but not for current release. If this issue is happening only on RHEL 6.5 and not (supported) RHEL 6.3 then we won't be able to provide any fix at this time. I will keep this issue in our bug-backlog for future release.




[MB-11799] Bucket compaction causes massive slowness of flusher and UPR consumers Created: 23/Jul/14  Updated: 28/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Issue Links:
Duplicate
is duplicated by MB-11731 Persistence to disk suffers from buck... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.
Comment by Pavel Paulau [ 26/Jul/14 ]
Toy build helps a lot.

It doesn't fix the problem but at least minimize regression:
-- ep_upr_replica_items_remaining is close to zero now
-- write queue is 10x lower
-- max xdcr latency is about 8-9 second

Logs: http://ci.sc.couchbase.com/view/lab/job/perf-dev/530/
Reports:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-785-toy_6ed_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-785-toy_269_access
Comment by Chiyoung Seo [ 26/Jul/14 ]
Thanks Pavel for the updates. We will merge the above changes soon.

Do you mean that both the disk write queue size and XDCR latency are still regression? or XDCR is only your major concern?

As you pointed above, the recent change in parallelizing the compaction (4 by default) is mostly the main root cause of this issue. Do you still see the compaction slowness in your tests? I guess "no" because we can now run 4 concurrent compaction tasks on each node.

I will talk to Aliaksey to understand that change more.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Pavel,

I will continue to look at some more optimizations in the ep-engine side. In the mean time, you may want to test the toy build again by lowering compaction_number_of_kv_workers in the ns-server side from 4 to 1. As mentioned in http://review.couchbase.org/#/c/39647/ , that parameter is configurable in the ns-server side.
Comment by Chiyoung Seo [ 26/Jul/14 ]
Btw, all the changes above were merged. You can use the new build and lower the above compaction parameter.
Comment by Pavel Paulau [ 28/Jul/14 ]
Build 3.0.0-1035 with compaction_number_of_kv_workers = 1:

http://ci.sc.couchbase.com/job/perf-dev/533/artifact/

Source: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1035_276_access
Destination: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1035_624_access

Disk write queue is lower (max ~5-10K) but xdcr latency is still high (several seconds) and affected by compaction.




[MB-11797] Rebalance-out hangs during Rebalance + Views operation in DGM run Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_02--view_query_extended-P1/145/consoleFull

Test to Reproduce:
./testrunner -i /tmp/ubuntu12-view6node.ini get-delays=True,get-cbcollect-info=True -t view.createdeleteview.CreateDeleteViewTests.incremental_rebalance_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction

Steps to Reproduce:
1. Setup 5-node cluster
2. Create default bucket
3. Load 200000 items
4. Load bucket to achieve dgm 10%
5. Create Views
6. Start ddoc + Rebalance out operations in parallel

Please refer attached log file "logs.txt".

Uploading Logs:


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/8586d8eb/172.23.106.201-7222014-2350-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/ea5d5a3f/172.23.106.199-7222014-2354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d06d7861/172.23.106.200-7222014-2355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/65653f65/172.23.106.198-7222014-2353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/dd05a054/172.23.106.197-7222014-2352-diag.zip
Comment by Sriram Melkote [ 23/Jul/14 ]
Nimish - to my eyes, it looks like views are not involved in this failure. Can you please take a look at the detailed log and assign to Alk if you agree? Thanks
Comment by Nimish Gupta [ 23/Jul/14 ]
From the logs:

[couchdb:info,2014-07-22T14:47:21.345,ns_1@172.23.106.199:<0.17993.2>:couch_log:info:39]Set view `default`, replica (prod) group `_design/dev_ddoc40`, signature `c018b62ae9eab43522a3d0c43ac48b3e`, terminating with reason: {upr_died,
                                                                                                                                       {bad_return_value,
                                                                                                                                        {stop,
                                                                                                                                         sasl_auth_failed}}}

One obvious problem is that we returned the wrong number of parameter for stop when sasl auth failed. That I have fixed, and is under review.(http://review.couchbase.org/#/c/39735/).

I don't know the reason why sasl auth failed or it may be normal for sasl auth to fail during rebalance. Meenakshi, could you please run the test again after this change is merged.
Comment by Nimish Gupta [ 23/Jul/14 ]
Trond has added code to log more information for sasl errors in memcached (http://review.couchbase.org/#/c/39738/). It will be helpful to debug sasl errors.
Comment by Meenakshi Goel [ 24/Jul/14 ]
Issue is reproducible with latest build 3.0.0-1020-rel.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/99/consoleFull
Uploading Logs shortly.
Comment by Meenakshi Goel [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/13f68e9c/172.23.106.186-7242014-1238-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/c0cf8496/172.23.106.187-7242014-1239-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/77b2fb50/172.23.106.188-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d0335545/172.23.106.189-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/7634b520/172.23.106.190-7242014-1241-diag.zip
Comment by Nimish Gupta [ 24/Jul/14 ]
From the ns_server logs, It looks to me memcached has crashed.

[error_logger:error,2014-07-24T12:28:36.305,ns_1@172.23.106.186:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.693.0>
    registered_name: []
    exception exit: {badmatch,{error,closed}}
      in function gen_server:init_it/6 (gen_server.erl, line 328)
    ancestors: ['single_bucket_sup-default',<0.675.0>]
    messages: []
    links: [<0.717.0>,<0.719.0>,<0.720.0>,<0.277.0>,<0.676.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 75113
    stack_size: 27
    reductions: 26397931
  neighbours:

Ep-engine/ns_server team please take a look.
Comment by Nimish Gupta [ 24/Jul/14 ]
From the logs:

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.31044.9>,
           {{{badmatch,{error,closed}},
             {gen_server,call,
                 ['ns_memcached-default',
                  {get_dcp_docs_estimate,321,
                      "replication:ns_1@172.23.106.187->ns_1@172.23.106.188:default"},
                  180000]}},
            {gen_server,call,
                [{'janitor_agent-default','ns_1@172.23.106.187'},
                 {if_rebalance,<0.15733.9>,
                     {wait_dcp_data_move,['ns_1@172.23.106.188'],321}},
                 infinity]}}}}
Comment by Sriram Melkote [ 25/Jul/14 ]
Alk, can you please take a look? Thanks!
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Quick hint for fellow coworkers. When you see connection closed usually first thing to check is if memcached has crashed. And in this case indeed it has (diag's cluster wide logs is perfect place to find this issues):

2014-07-24 12:28:35.861 ns_log:0:info:message(ns_1@172.23.106.186) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:09:47.941525 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 650) stream created with start seqno 5794 and end seqno 18446744073709551615
Thu Jul 24 12:09:49.115570 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 749, cookie 0x606f800
Thu Jul 24 12:09:49.380310 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 648, cookie 0x6070d00
Thu Jul 24 12:09:49.450869 PDT 3: (default) UPR (Consumer) eq_uprq:replication:ns_1@172.23.106.189->ns_1@172.23.106.186:default - (vb 648) Attempting to add takeover stream with start seqno 5463, end seqno 18446744073709551615, vbucket uuid 35529072769610, snap start seqno 5463, and snap end seqno 5463
Thu Jul 24 12:09:49.495674 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 648) stream created with start seqno 5463 and end seqno 18446744073709551615
2014-07-24 12:28:36.302 ns_memcached:0:info:message(ns_1@172.23.106.186) - Control connection to memcached on 'ns_1@172.23.106.186' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_memcached:0:info:message(ns_1@172.23.106.187) - Control connection to memcached on 'ns_1@172.23.106.187' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_log:0:info:message(ns_1@172.23.106.187) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:28:35.860224 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1019) Stream closing, 0 items sent from disk, 0 items sent from memory, 5781 was last seqno sent
Thu Jul 24 12:28:35.860235 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1020) Stream closing, 0 items sent from disk, 0 items sent from memory, 5879 was last seqno sent
Thu Jul 24 12:28:35.860246 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1021) Stream closing, 0 items sent from disk, 0 items sent from memory, 5772 was last seqno sent
Thu Jul 24 12:28:35.860256 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1022) Stream closing, 0 items sent from disk, 0 items sent from memory, 5427 was last seqno sent
Thu Jul 24 12:28:35.860266 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1023) Stream closing, 0 items sent from disk, 0 items sent from memory, 5480 was last seqno sent

Status 137 is 128 (death by signal (set by kernel)) + 9. So signal 9. dmesg (captured in couchbase.log) does not have signs of OOM. This means - humans :) Not the first and sadly not the last time something like this happens. Rogue scripts, bad tests etc.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Also we should stop the practice if reusing tickets for unrelated conditions. This doesn't look anywhere close to rebalance hang isnt?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Not sure what to do about this one. Closing as incomplete will probably not hurt.




[MB-11796] Rebalance after manual failover hangs (delta recovery) Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb11.log     Text File gdb12.log     Text File gdb13.log     Text File gdb14.log     Text File master_events.log    
Issue Links:
Duplicate
duplicates MB-11768 movement of 27 empty replica vbuckets... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.11.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.12.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.13.zip
https://s3.amazonaws.com/bugdb/jira/MB-11796/172.23.96.14.zip
Is this a Regression?: Yes

 Description   
1 of 4 nodes is being re-added after failover.
100M x 2KB items, 10K mixed ops/sec.

Steps:
1. Failover one of nodes.
2. Add it back.
3. Enabled delta recovery.
4. Sleep 20 minutes.
5. Rebalance cluster.

Warmup is completed but rebalance hangs afterwards.

 Comments   
Comment by Sriram Ganesan [ 23/Jul/14 ]
I see the following log messages

Tue Jul 22 23:16:44.367356 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.11->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367363 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.14->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds
Tue Jul 22 23:16:44.367376 PDT 3: (bucket-1) UPR (Consumer) eq_uprq:replication:ns_1@172.23.96.13->ns_1@172.23.96.12:bucket-1 - Disconnecting because noop message has no been received for 40 seconds

I also see messages like this

Wed Jul 23 02:30:49.306705 PDT 3: 155 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310060 PDT 3: 144 Closing connection due to read error: Connection reset by peer
Wed Jul 23 02:30:49.310273 PDT 3: 152 Closing connection due to read error: Connection reset by peer

The first set of the messages could be a bug in UPR that could be causing the disconnections and the second set of the messages could be because we are trying to read from a disconnected socket. Interestingly, a fix was merged for bug MB-11803 (http://review.couchbase.org/#/c/39760/) in the UPR noop area recently. It might be a good idea to run this test with that fix to see if that could address the problem.

I don't see any of the above error messages in the logs of MB-11768. So, the seqnoWaitingStarted in this case could be different from the one in MB-11768 assuming that the fix for MB-11803 solves this problem.

Comment by Pavel Paulau [ 24/Jul/14 ]
Indeed, that fix helped.




[MB-11795] Rebalance exited with reason {unexpected_exit, {'EXIT',<0.27836.0>,{bulk_set_vbucket_state_failed...} Created: 23/Jul/14  Updated: 24/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Meenakshi Goel Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1005-rel

Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.sc.couchbase.com/job/centos_x64--29_01--create_view_all-P1/126/consoleFull

Test to Reproduce:
./testrunner -i myfile.ini get-cbcollect-info=True,get-logs=True, -t view.createdeleteview.CreateDeleteViewTests.rebalance_in_and_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=3,num_views_per_ddoc=2,items=200000,sasl_buckets=1

Steps to Reproduce:
1. Setup a 4-node cluster
2. Create 1 default and 1 sasl bucket
3. Rebalance in 2 nodes
4. Start Rebalance

Logs:

[user:info,2014-07-23T2:25:43.220,ns_1@172.23.107.24:<0.1154.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {unexpected_exit,
                              {'EXIT',<0.27836.0>,
                               {bulk_set_vbucket_state_failed,
                                [{'ns_1@172.23.107.24',
                                  {'EXIT',
                                   {{{{{badmatch,
                                        [{<0.27848.0>,
                                          {done,exit,
                                           {normal,
                                            {gen_server,call,
                                             [<0.14598.0>,
                                              {setup_streams,
                                               [684,692,695,699,704,705,706,
                                                707,708,709,710,711,712,713,
                                                714,715,716,717,718,719,720,
                                                721,722,723,724,725,726,727,
                                                728,729,730,731,732,733,734,
                                                735,736,737,738,739,740,741,
                                                742,743,744,745,746,747,748,
                                                749,750,751,752,753,754,755,
                                                756,757,758,759,760,761,762,
                                                763,764,765,766,767,768,769,
                                                770,771,772,773,774,775,776,
                                                777,778,779,780,781,782,783,
                                                784,785,786,787,788,789,790,
                                                791,792,793,794,795,796,797,
                                                798,799,800,801,802,803,804,
                                                805,806,807,808,809,810,811,
                                                812,813,814,815,816,817,818,
                                                819,820,821,822,823,824,825,
                                                826,827,828,829,830,831,832,
                                                833,834,835,836,837,838,839,
                                                840,841,842,843,844,845,846,
                                                847,848,849,850,851,852,853]},
                                              infinity]}},
                                           [{gen_server,call,3,
                                             [{file,"gen_server.erl"},
                                              {line,188}]},
                                            {upr_replicator,
                                             '-spawn_and_wait/1-fun-0-',1,
                                             [{file,"src/upr_replicator.erl"},
                                              {line,195}]}]}}]},
                                       [{misc,
                                         sync_shutdown_many_i_am_trapping_exits,
                                         1,
                                         [{file,"src/misc.erl"},{line,1429}]},
                                        {upr_replicator,spawn_and_wait,1,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,217}]},
                                        {upr_replicator,handle_call,3,
                                         [{file,"src/upr_replicator.erl"},
                                          {line,112}]},
                                        {gen_server,handle_msg,5,
                                         [{file,"gen_server.erl"},{line,585}]},
                                        {proc_lib,init_p_do_apply,3,
                                         [{file,"proc_lib.erl"},{line,239}]}]},
                                      {gen_server,call,
                                       ['upr_replicator-bucket0-ns_1@172.23.107.26',
                                        {setup_replication,
                                         [684,692,695,699,704,705,706,707,708,
                                          709,710,711,712,713,714,715,716,717,
                                          718,719,720,721,722,723,724,725,726,
                                          727,728,729,730,731,732,733,734,735,
                                          736,737,738,739,740,741,742,743,744,
                                          745,746,747,748,749,750,751,752,753,
                                          754,755,756,757,758,759,760,761,762,
                                          763,764,765,766,767,768,769,770,771,
                                          772,773,774,775,776,777,778,779,780,
                                          781,782,783,784,785,786,787,788,789,
                                          790,791,792,793,794,795,796,797,798,
                                          799,800,801,802,803,804,805,806,807,
                                          808,809,810,811,812,813,814,815,816,
                                          817,818,819,820,821,822,823,824,825,
                                          826,827,828,829,830,831,832,833,834,
                                          835,836,837,838,839,840,841,842,843,
                                          844,845,846,847,848,849,850,851,852,
                                          853]},
                                        infinity]}},
                                     {gen_server,call,
                                      ['replication_manager-bucket0',
                                       {change_vbucket_replication,684,
                                        'ns_1@172.23.107.26'},
                                       infinity]}},
                                    {gen_server,call,
                                     [{'janitor_agent-bucket0',
                                       'ns_1@172.23.107.24'},
                                      {if_rebalance,<0.1353.0>,
                                       {update_vbucket_state,684,replica,
                                        undefined,'ns_1@172.23.107.26'}},
                                      infinity]}}}}]}}}

Uploading Logs


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11795/f9ad56ee/172.23.107.24-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/07e24114/172.23.107.25-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/a9c9a36d/172.23.107.26-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11795/2517f70b/172.23.107.27-diag.zip
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
Seeing downstream (upr replicator to upr consumer) connection being closed.

Possibly due to this message

Wed Jul 23 02:25:43.080600 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - (vb 684) Attempting to add stream with start seqno 0, end seqno 18446744073709551615, vbucket uuid 139895607874175, snap start seqno 0, and snap end seqno 0
Wed Jul 23 02:25:43.080642 PDT 3: (bucket0) UPR (Consumer) eq_uprq:replication:ns_1@172.23.107.26->ns_1@172.23.107.24:bucket0 - Disconnecting because noop message has no been received for 40 seconds
Wed Jul 23 02:25:43.082958 PDT 3: (bucket0) UPR (Producer) eq_uprq:replication:ns_1@172.23.107.24->ns_1@172.23.107.25:bucket0 - (vb 359) Stream closing, 0 items sent from disk, 0 items sent from memory, 0 was last seqno sent

This is on .24.

Appears related to yesterday's fix to detect disconnects on consumer side.
Comment by Aleksey Kondratenko [ 23/Jul/14 ]
CC-ed Chiyoung and optimistically passed this to Mike due to apparent relation to fix made (AFAIK) by Mike.
Comment by Mike Wiederhold [ 23/Jul/14 ]
Duplicate of MB-18003
Comment by Ketaki Gangal [ 24/Jul/14 ]
MB-11803* ?
Comment by Chiyoung Seo [ 24/Jul/14 ]
Ketaki,

Yes, it's MB-11803.




[MB-11794] Creating 10 buckets causes memcached segmentation fault Created: 23/Jul/14  Updated: 26/Jul/14  Resolved: 24/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Pavel Paulau Assignee: Pavel Paulau
Resolution: Fixed Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-998

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

Attachments: Text File gdb.log    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/ares/396/artifact/
Is this a Regression?: Yes

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Sundar,

The backtrace indicates that it is mostly a regression from the vbucket-level lock change for flusher, vb snapshot, compaction, and vbucket deletion task, which we made recently.
Comment by Pavel Paulau [ 23/Jul/14 ]
The same issue happened with single bucket. The problem seems rather common.
Comment by Sundar Sridharan [ 24/Jul/14 ]
Found root cause - cachedVBStates is not preallocated and is modified in a thread unsafe manner. This regression shows up now because we have more parallelism with vbucket-level locking. Working on the fix.
Comment by Sundar Sridharan [ 24/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/#/c/39834/ thanks
Comment by Chiyoung Seo [ 24/Jul/14 ]
The fix was merged.




[MB-11793] Build breakage in upr-consumer.cc Created: 22/Jul/14  Updated: 23/Jul/14  Due: 23/Jul/14  Resolved: 23/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: .master
Fix Version/s: .master
Security Level: Public

Type: Task Priority: Test Blocker
Reporter: Chris Hillery Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: ep-engine
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Commit 8d636bbb02b0338df9e73c2573422b6463feb92d to ep-engine appears to be breaking the build on most platforms, eg:

http://builds.hq.northscale.net:8010/builders/centos-6-x64-master-builder/builds/890/steps/couchbase-server%20make%20enterprise%20/logs/stdio

 Comments   
Comment by Mike Wiederhold [ 23/Jul/14 ]
Just want to note here that this does not affect 3.0 builds in case anyone is looking at the ticket. The merge of the memcached 3.0 branch is linked below. Since I don't think anyone is working on the master branch I'm going to wait for someone to review the change.

http://review.couchbase.org/#/c/39708/




[MB-11792] Link in readme file in 3.0 does not work correctly Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Links still show 2.5 doc in browser

http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-network-ports.html
http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-bestpractice.html
http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-redhat.html
 http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-ubuntu.html

 Comments   
Comment by Thuan Nguyen [ 22/Jul/14 ]
Tested on build 3.0.0-973
Comment by Amy Kurtzman [ 22/Jul/14 ]
Those links are not correct for the documentation. Also, the 3.0 beta documentation hasn't been published yet and they wouldn't work yet even if they were correct.

Docs are located at http://docs.couchbase.com/
(they are not at www.couchbase.com)




[MB-11791] README in ubuntu 12.04 image shows incorrect information Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: doc-system
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Attachments: Text File README_Linux_3.0_beta.txt     Text File README_Mac_3.0_beta.txt     Text File README_Windows_3.0_beta.txt    
Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
In README.txt in couchbase server version 3.0.0-973 for ubuntu 12.04, it shows incorrect information.
Couchbase server for ubuntu 12.04 does not depend on libssl0.9.8 as show in README.txt

root@ubuntu:~# ls /tmp/
couchbase-server-enterprise_ubuntu_1204_x86_64_3.0.0-973-rel.deb ssh-ZnnQQUv795 vmware-root
root@ubuntu:~# more /opt/couchbase/README.txt
Couchbase Server 3.0.0, Ubuntu and Centos

Couchbase Server is a distributed NoSQL document database for interactive applications. Its scale-out architecture runs in the cloud or on commodity hardware and provides a flexible data model, consistent high-performa
nce, easy scalability and always-on 24x365 availability. This release contains fixes as well as new features and functionality, including:

- Multiple Readers and Writers threads for more rapid persistence onto disk
-'Optimistic Replication' to improve latency when you replicate documents via XDCR
- More XDCR Statistics to monitor performance and behavior of XDCR
- Detailed Rebalance Report to show actual number of buckets and keys that have been transferred to other nodes in a cluster
- Transfer, Backup and Restore can be done for design documents only. You do not need to include data. The default behavior is to transfer both data and design documents.
- Hostname Management provided as easy to use interfaces in Web Console and Installation Wizard
- Command Line tools updated so you can manage nodes, buckets, clusters and XDCR
- Upload CSV files into Couchbase with cbtransfer

For more information, see our Release Notes: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-server-rn.html

REQUIREMENTS

- For Ubuntu platforms you will need a OpenSSL dependency or your server will not run. Do the following:

    root-shell> apt-get install libssl0.9.8

    OpenSSL is already included with Centos

- To run cbcollect_info you must have administrative privileges

INSTALL

Centos: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-redhat.html

Ubuntu: http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-getting-started-install-ubuntu.html

By default we install Couchbase Server at /opt/couchbase

The server will automatically start after install and will be available by default on port 8091

For a full list of network ports for Couchbase Server, see http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-network-ports.html

To read more about Couchbase Server best practices, see http://www.couchbase.com/docs/couchbase-manual-3.0.0/couchbase-bestpractice.html



 Comments   
Comment by Ruth Harris [ 22/Jul/14 ]
Some of this content looks really really old. I'm attaching READMEs for all 3 operating systems.

Comment by Ruth Harris [ 22/Jul/14 ]
All 3 operating systems.




[MB-11790] couchbase-cli helps does not show https in uploadHost in cluster-wide collectinfo (only https protocol supported) Created: 22/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Tested build 3.0.0-1001

Start cluster-wide log collection for whole cluster
    couchbase-cli collect-logs-start -c 192.168.0.1:8091 \
        -u Administrator -p password \
        --all-nodes --upload --upload-host=host.upload.com \
        --customer="example inc" --ticket=12345

 

 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
https protocol is implementation detail. As parameter, only hostname is required. So help text is accurate.




[MB-11789] couchbase-cli helps should give example how to collect some nodes in cluster-wide collectinfo Created: 22/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
couchbase-cli help does not show how to do cluster wide collectinfo in some nodes, not all nodes

Start cluster-wide log collection for whole cluster
    couchbase-cli collect-logs-start -c 192.168.0.1:8091 \
        -u Administrator -p password \
        --all-nodes --upload --upload-host=host.upload.com \
        --customer="example inc" --ticket=12345

  Stop cluster-wide log collection
    couchbase-cli collect-logs-stop -c 192.168.0.1:8091 \
        -u Administrator -p password

  Show status of cluster-wide log collection
    couchbase-cli collect-logs-status -c 192.168.0.1:8091 \
        -u Administrator -p password

 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
http://review.couchbase.org/#/c/39951/
Comment by Thuan Nguyen [ 28/Jul/14 ]
Need to add https as in comment in the fix
Comment by Bin Cui [ 28/Jul/14 ]
We should not specify protocol in upload-host parameter, just ip address will be good enough. https will be used and added by server when it use the parameter to upload files.




[MB-11788] [ui] getting incorrect ejection policy update warning when simply updating bucket's quota Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
SUBJ.

1. Create bucket aaa

2. Open bucket aaa Edit dialog

3. Change quota and hit enter

4. Observe modal popup warning that should not be there. I.e. we're not updating any property that would require bucket restart but we're getting warning.





[MB-11787] couchbase-cli should validate host before running cluster-wide collection Created: 22/Jul/14  Updated: 28/Jul/14  Resolved: 28/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase 3.0.0-999 on one ubuntu 12.04 node
Do cluster-wide collectinfo with upload option.

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --allnodes --upload --upload-host=http://abcnn.com --customer=1234 --ticket=

couchbase-cli collect-logs-start did not validate valid host (https) before running collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-status -c 127.0.0.1:8091 -u Administrator -p password
Status: running
Details:
Node: ns_1@127.0.0.1
Status: started
path : /opt/couchbase/var/lib/couchbase/tmp/collectinfo-2014-07-22T213346-ns_1@127.0.0.1.zip



 Comments   
Comment by Bin Cui [ 28/Jul/14 ]
1. upload-host parameter value is not quite right. You should not put http:// as part of value. Server will add the right protocol as needed.
2. Server will have accessibility checking. If invalid hostname provided, the following error will return:

Failed to check reachability of https://hostname/&lt;username>




[MB-11786] {UPR}:: Rebalance-out hangs due to indexing stuck Created: 22/Jul/14  Updated: 26/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Parag Agarwal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Seeing this issue in 991

1. Create 7 node cluster (10.6.2.144-150)
2. Create default Bucket
3. Add 1K items
4. Create 5 views and query
5. Rebalance out node 10.6.2.150

Step 4 and 5 are run in parallel

We see the rebalance hanging

I am seeing the following issue couchdb log in 10.6.2.150

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

Stacktrace: [{couch_index_merger,query_index,3,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_index_merger/src/couch_index_merger.erl"},
                  {line,75}]},
             {couch_httpd,handle_request,6,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couchdb/couch_httpd.erl"},
                  {line,222}]},
             {mochiweb_http,headers,5,


Will attach logs ASAP

Test Case:: ./testrunner -i ubuntu_x64--109_00--Rebalance-Out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_queries,nodes_out=1,blob_generator=False,value_size=1024,GROUP=OUT;BASIC;P0;FROM_2_0

 Comments   
Comment by Parag Agarwal [ 22/Jul/14 ]
The cluster is live if you want to investigate 10.6.2.144-150.
Comment by Parag Agarwal [ 22/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11786/991_logs.tar.gz
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
We're waiting for index to become updated.

I.e. I see a number of this:

     {<17674.13818.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007f64917effa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f6493d4f070 Return addr 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.9.246202>">>,<<"y(1) infinity">>,
                   <<"y(2) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.11899.5>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007f6493d4f0a8 Return addr 0x00007f6444879940 (janitor_agent:wait_index_updated/5 + 432)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(2) {'janitor_agent-default','ns_1@10.6.2.144'}">>,
                   <<"y(3) Catch 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d0 Return addr 0x00007f6444a49ea8 (ns_single_vbucket_mover:'-wait_index_updated/5-fun-0-'/5 + 104)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d8 Return addr 0x00007f64917f38a0 (proc_lib:init_p/3 + 688)">>,
                   <<>>,
                   <<"0x00007f6493d4f0e0 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007f64917f38c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,2}]},
       {heap_size,610},
       {total_heap_size,1597},
       {links,[<17674.13242.5>]},
       {memory,13688},
       {message_queue_len,0},
       {reductions,806},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And this:
     {<0.13891.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f64448ad040 (capi_set_view_manager:'-do_wait_index_updated/4-lc$^0/1-0-'/3 + 64)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f643e3ac948 Return addr 0x00007f64448abb90 (capi_set_view_manager:do_wait_index_updated/4 + 848)">>,
                   <<"y(0) #Ref<0.0.9.246814>">>,
                   <<"y(1) #Ref<0.0.9.246821>">>,
                   <<"y(2) #Ref<0.0.9.246820>">>,<<"y(3) []">>,<<>>,
                   <<"0x00007f643e3ac970 Return addr 0x00007f64917f3ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) {<0.13890.5>,#Ref<0.0.9.246813>}">>,<<>>,
                   <<"0x00007f643e3ac980 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f64917f3ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,5}]},
       {heap_size,987},
       {total_heap_size,1974},
       {links,[]},
       {memory,16808},
       {message_queue_len,0},
       {reductions,1425},
       {trap_exit,false}]}
Comment by Parag Agarwal [ 22/Jul/14 ]
Still seeing the issue in 3.0.0-1000, centos 6x, ubuntu 1204
Comment by Sriram Melkote [ 22/Jul/14 ]
Sarath, can you please take a look?
Comment by Nimish Gupta [ 22/Jul/14 ]
The error in http query will not hang the rebalance. Http query error is happening since ddoc is updated.
I see there is error in getting mutation for partition 127 from ep-engine :

[couchdb:info,2014-07-22T13:37:59.764,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.866,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.967,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.070,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.171,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.272,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.373,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.474,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.575,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.676,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.777,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.878,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.979,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...

There are lot of above continuous message till the logs are collected.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Yes, ep-engine kept on returning ETMPFAIL for partition 127's stream request. Hence, indexing never progressed.
EP-Engine team should take a look.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Tue Jul 22 13:52:14.041453 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state
Tue Jul 22 13:52:14.143551 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state

It seems that vbucket 127 is in backfill state and it never gets completed.
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/39896
Comment by Parag Agarwal [ 26/Jul/14 ]
Does not repro in 1033




[MB-11785] mcd aborted in bucket_engine_release_cookie: "es != ((void *)0)" Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64 vb cluster_run -n1

Attachments: Zip Archive collectinfo-2014-07-22T192534-n_0@127.0.0.1.zip    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Observed while running pyupr unit tests against latest from rel-3.0.0 branch.

 After about 20 tests the crash occurred on test_failover_log_n_producers_n_vbuckets. This test passes stand alone so I think it's a matter of running all the tests in succession and then coming across this issue.

backtrace:

Thread 228 (Thread 0x7fed2e7fc700 (LWP 695)):
#0 0x00007fed8b608f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed8b60c388 in __GI_abort () at abort.c:89
#2 0x00007fed8b601e36 in __assert_fail_base (fmt=0x7fed8b753718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fed8949f28c "es != ((void *)0)",
    file=file@entry=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=line@entry=3301,
    function=function@entry=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:92
#3 0x00007fed8b601ee2 in __GI___assert_fail (assertion=0x7fed8949f28c "es != ((void *)0)",
    file=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=3301,
    function=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:101
#4 0x00007fed8949d13d in bucket_engine_release_cookie (cookie=0x5b422e0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:3301
#5 0x00007fed8835343f in EventuallyPersistentEngine::releaseCookie (this=0x7fed4808f5d0, cookie=0x5b422e0)
    at /couchbase/ep-engine/src/ep_engine.cc:1883
#6 0x00007fed8838d730 in ConnHandler::releaseReference (this=0x7fed7c0544e0, force=false)
    at /couchbase/ep-engine/src/tapconnection.cc:306
#7 0x00007fed883a4de6 in UprConnMap::shutdownAllConnections (this=0x7fed4806e4e0)
    at /couchbase/ep-engine/src/tapconnmap.cc:1004
#8 0x00007fed88353e0a in EventuallyPersistentEngine::destroy (this=0x7fed4808f5d0, force=true)
    at /couchbase/ep-engine/src/ep_engine.cc:2034
#9 0x00007fed8834dc05 in EvpDestroy (handle=0x7fed4808f5d0, force=true) at /couchbase/ep-engine/src/ep_engine.cc:142
#10 0x00007fed89498a54 in engine_shutdown_thread (arg=0x7fed48080540)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1564
#11 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480a5b60) at /couchbase/platform/src/cb_pthreads.c:19
#12 0x00007fed8beba182 in start_thread (arg=0x7fed2e7fc700) at pthread_create.c:312
#13 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 226 (Thread 0x7fed71790700 (LWP 693)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093e80, mutex=0x7fed78093e48, ms=720)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78093e40, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78093e40, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78093e40, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78093e40, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480203e0) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71790700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 225 (Thread 0x7fed71f91700 (LWP 692)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093830, mutex=0x7fed780937f8, ms=86390052)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780937f0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780937f0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780937f0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780937f0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801d490) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71f91700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 224 (Thread 0x7fed72792700 (LWP 691)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3894)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801a670) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed72792700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 223 (Thread 0x7fed70f8f700 (LWP 690)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3893)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed48017850) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed70f8f700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 222 (Thread 0x7fed7078e700 (LWP 689)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1672)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b8e90) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed7078e700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 221 (Thread 0x7fed0effd700 (LWP 688)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1673)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b6890) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed0effd700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 210 (Thread 0x7fed0f7fe700 (LWP 661)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed740e8910)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed740667e0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 201 (Thread 0x7fed0ffff700 (LWP 644)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed74135070)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed74050ef0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

---Type <return> to continue, or q <return> to quit---
Thread 192 (Thread 0x7fed2cff9700 (LWP 627)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7c90)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c078340) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2cff9700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 183 (Thread 0x7fed2d7fa700 (LWP 610)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009e000)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5009dfe0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2d7fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 174 (Thread 0x7fed2dffb700 (LWP 593)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009dc30)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed50031010) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2dffb700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 165 (Thread 0x7fed2f7fe700 (LWP 576)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed481cef20)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480921c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 147 (Thread 0x7fed2effd700 (LWP 541)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed540015d0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54057b80) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2effd700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 138 (Thread 0x7fed6df89700 (LWP 523)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed78092aa0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78056ea0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6df89700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 120 (Thread 0x7fed2ffff700 (LWP 489)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7d10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c1b7ac0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 111 (Thread 0x7fed6cf87700 (LWP 472)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5008c030)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500adf50) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6cf87700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 102 (Thread 0x7fed6d788700 (LWP 455)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
---Type <return> to continue, or q <return> to quit---
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080450)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54091560) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6d788700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 93 (Thread 0x7fed6ff8d700 (LWP 438)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080ad0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54068db0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ff8d700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 57 (Thread 0x7fed6e78a700 (LWP 370)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50080230)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5008c360) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6e78a700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 48 (Thread 0x7fed6ef8b700 (LWP 352)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50000c10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500815b0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ef8b700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 39 (Thread 0x7fed6f78c700 (LWP 334)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed4807c290)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
---Type <return> to continue, or q <return> to quit---
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4806e4c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6f78c700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 13 (Thread 0x7fed817fa700 (LWP 292)):
#0 0x00007fed8b693d7d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b6c5334 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x00007fed88386dd2 in updateStatsThread (arg=0x7fed780343f0) at /couchbase/ep-engine/src/memory_tracker.cc:36
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78034450) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed817fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 10 (Thread 0x7fed8aec4700 (LWP 116)):
#0 0x00007fed8b6be6bd in read () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b64d4e0 in _IO_new_file_underflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at fileops.c:613
#2 0x00007fed8b64e46e in __GI__IO_default_uflow (fp=0x7fed8b992640 <_IO_2_1_stdin_>) at genops.c:435
#3 0x00007fed8b642184 in __GI__IO_getline_info (fp=0x7fed8b992640 <_IO_2_1_stdin_>, buf=0x7fed8aec3e40 "", n=79, delim=10,
    extract_delim=1, eof=0x0) at iogetline.c:69
#4 0x00007fed8b641106 in _IO_fgets (buf=0x7fed8aec3e40 "", n=0, fp=0x7fed8b992640 <_IO_2_1_stdin_>) at iofgets.c:56
#5 0x00007fed8aec5b24 in check_stdin_thread (arg=0x41c0ee <shutdown_server>)
    at /couchbase/memcached/extensions/daemon/stdin_check.c:38
#6 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a66250) at /couchbase/platform/src/cb_pthreads.c:19
#7 0x00007fed8beba182 in start_thread (arg=0x7fed8aec4700) at pthread_create.c:312
#8 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 9 (Thread 0x7fed89ea3700 (LWP 117)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed8a6c3280 <cond>, mutex=0x7fed8a6c3240 <mutex>, ms=19000)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8a4c0fea in logger_thead_main (arg=0x1a66fe0) at /couchbase/memcached/extensions/loggers/file_logger.c:372
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x1a67050) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed89ea3700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fed89494700 (LWP 135)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9cb0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd0f0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed89494700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
---Type <return> to continue, or q <return> to quit---

Thread 7 (Thread 0x7fed88c93700 (LWP 136)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9da0) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd240) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed88c93700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111






Thread 6 (Thread 0x7fed83fff700 (LWP 137)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9e90) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd390) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed83fff700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fed837fe700 (LWP 138)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bc9f80) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd4e0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed837fe700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fed82ffd700 (LWP 139)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca070) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd630) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed82ffd700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fed827fc700 (LWP 140)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca160) at /couchbase/memcached/daemon/thread.c:277
---Type <return> to continue, or q <return> to quit---
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd780) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed827fc700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fed81ffb700 (LWP 141)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041df04 in worker_libevent (arg=0x5bca250) at /couchbase/memcached/daemon/thread.c:277
#4 0x00007fed8cf43963 in platform_thread_wrap (arg=0x5bcd8d0) at /couchbase/platform/src/cb_pthreads.c:19
#5 0x00007fed8beba182 in start_thread (arg=0x7fed81ffb700) at pthread_create.c:312
#6 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fed8d764780 (LWP 113)):
#0 0x00007fed8b6cd9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8c725ef3 in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#2 0x00007fed8c712295 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5
#3 0x000000000041d24e in main (argc=3, argv=0x7fff77aaa838) at /couchbase/memcached/daemon/memcached.c:8797

 Comments   
Comment by Chiyoung Seo [ 23/Jul/14 ]
Abhinav,

The backtrace indicates that the abort crash was caused by closing all the UPR connections during shutdown, which we made some fixes recently.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
Tommie, can you tell how to run these tests, so I could try reproducing on my system?
Comment by Tommie McAfee [ 24/Jul/14 ]
*start a cluster run node then:

git clone https://github.com/couchbaselabs/pyupr.git
cd pyupr
./pyupr -h 127.0.0.1:9000 -b dev


noticed all the tests may pass but memcached can silently abort in the background.
Comment by Abhinav Dangeti [ 24/Jul/14 ]
1. ServerSide: If an upr producer or upr consumer already exists for that cookie, engine should return DISCONNECT: http://review.couchbase.org/#/c/39843
2. py-upr: In the test: test_failover_log_n_producers_n_vbuckets, you are essentially opening 1 connection and sending 1024 open connection messages, so many tests will need changes.
Comment by Chiyoung Seo [ 24/Jul/14 ]
Tommie,

The server side fix was merged.

Can you please fix the issue in the test script and retest it?
Comment by Tommie McAfee [ 25/Jul/14 ]
thanks, working now and affected tests pass with patch:

http://review.couchbase.org/#/c/39878/1




[MB-11784] GUI incorrectly displays vBucket number in stats Created: 22/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Ian McCloy Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 251VbucketDisplay.png     PNG File 3fixVbucketDisplay.png    
Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Many customers are confused and have complained that on the "General Bucket Analytics" / "VBUCKET RESOURCES" page when listing the number of vBuckets, the GUI tries to convert the value 1024 default vBuckets to kilobytes, so it displays as 1.02k vBuckets (screen shot attached) . vBuckets shouldn't be parsed and always should show the full number.

I've changed the javascript to detect for vBuckets values and not parse them, (screen shot attached) . Will amend with a gerrit link when it's pushed to review.

 Comments   
Comment by Ian McCloy [ 22/Jul/14 ]
Code added to gerrit for review -> http://review.couchbase.org/#/c/39668/
Comment by Pavel Blagodov [ 24/Jul/14 ]
Hi Ian, here is clarification:
- kilo (or 'K') is a unit prefix in the metric system denoting multiplication by one thousand.
- kilobyte (or 'KB') is a multiple of the unit byte for digital information.
Comment by Ian McCloy [ 24/Jul/14 ]
Pavel thank you for clearing that up for me. Can you please explain when I see 1.02K vBuckets in the stats is that 1022, 1023 or 1024 active vBuckets, I'm not clear when I look at the UI.
Comment by Pavel Blagodov [ 25/Jul/14 ]
1.02K is expected value because currently UI truncates all analytic stats to three digits. Of course we may increase this number to four digits but this will be working only for K (not for M for example).
Comment by David Haikney [ 25/Jul/14 ]
@Pavel - Yes 1.02k is currently expected but the desire here is to change the UI to show "1024" instead of "1.02K". Fewer characters and more accuracy.




[MB-11783] We need the administrator creds available in isasl.pw Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Major
Reporter: Trond Norbye Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
We'd like to add authentication for some of the operations (like setting configuration tunables dynamically). Instead of telling the user to go look in on the system for isasl.pw and dig out the _admin entry and then use that and the generated password, it would be nice if the credentials defined when setting up the cluster could be used.

 Comments   
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Interesting. Very much :)

I cannot do it because we don't have administrator creds anymore. We just have some kind of password hash and that's it.

I admit my fault. I could be more forward looking. But it was somewhat guided by your response back in the day which I interpreted as reluctance to allow memcached auth via admin credentials.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And of course all admin ops to memcached can still be safely and in more controlled way (global if needed, or local if needed) be handled by ns_server.




[MB-11782] Adding Nodes To A Cluster Can Result In Reduced Active Residency Percentages Created: 22/Jul/14  Updated: 22/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Minor
Reporter: Morrie Schreibman Assignee: Ravi Mayuram
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Customer added 6 nodes to a large cluster in an attempt to increase the overall percentage of active bucket data in cache, and observed that the active bucket residency decreased after rebalancing. Decrease in active data residency after adding nodes and rebalancing turns out to be reproducible.

To reproduce this anomaly, create an 8-node cluster with RAM quota of 100Mbytes per node and populate the default bucket until active percent in memory is about 40%. (I used cbWorkLoadGen and inserted 300K items into the default bucket, specifying an item size of 2K bytes and enabling the -j (JSON) option.) Add 3 nodes to this cluster and rebalance. The resulting default active memory residency percentage will drop significantly and the replica residency percentage will increase. Note that if 3 random nodes are then removed and rebalanced and then added back and rebalanced again, active residency will increase beyond the initial level.

The critical factor in reproducing this anomaly is that the bucket data size must exceed its RAM quota such that the majority of bucket data resides on disk at any given time. When nodes are added to the cluster, the subsequent rebalance results in entire vbuckets read from disk on 1 node and dumped to cache on the receiving node via TAP protocol. Eventually, the node high-water mark will be exceeded and ejections occur. What is consistently observable is that active ejections occur at a greater rate than replica ejections and results in a decreased active bucket residency percentage and an increased replica bucket residency percentage.

Possible workarounds include adding/rebalancing nodes in stages, e.g., instead of adding 6 nodes to a cluster at once, add 3 nodes, rebalance than add 3 more nodes and rebalance again. A 2nd potential workaround would be to alter the default ejection probabilities for replica and active data to reduce the probability of ejecting active data and increase the probability of ejecting replica data. I have not had time to test these possible workarounds.

After discussion in the Support group, our thinking is that any configuration change which is enabled with the intention of improving performance should not result in worsened performance, but that is what can happen in this case. Accordingly we believe that this is a bug and that the rebalancing algorithm should be examined to figure out why - under certain circumstances - rebalancing can cause a higher probability of active data to be ejected .

 Comments   
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
Great job finding this out.

But I cannot just go ahead and improve it. And keep in mind that upr might change things a lot both in better and in worse direction.

Eviction is something that I don't any control or much understanding of. I believe you'll need to ask Chiyoung's team to provide some instructions on what to do.

I can only add one guess. If it's related to multiple vbuckets being moved at same time (which might be but it's hard to say how much it contributes), then you will be able to check that by lowering rebalanceMovesBeforeCompaction internal settings.
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
With that said, passing to higher levels of engineering suborganization.




Generated at Tue Jul 29 02:30:03 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.