[MB-11440] {XDCR SSL UPR}: Possible regression in replication rate compared to 2.5.1 Created: 17/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Done Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-814
XDCR -> UPR

Attachments: Zip Archive 10.3.4.186-6232014-1614-diag.zip     Zip Archive 10.3.4.187-6232014-1615-diag.zip     Zip Archive 10.3.4.188-6232014-1616-diag.zip     Zip Archive 10.3.4.189-6232014-1618-diag.zip     File revIDs.rtf     Text File revID_xmem.txt     Text File xmem2_revIDs.txt    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: [Source]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/049a07cd/10.1.3.93-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/5d80dcb2/10.1.3.93-6162014-1236-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c864813f/10.1.3.93-6162014-1228-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/6766628f/10.1.3.94-6162014-1237-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8a3cd5c2/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c5818561/10.1.3.94-6162014-1228-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2e4bb369/10.1.3.95-6162014-1238-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/589f740c/10.1.3.95-diag.txt.gz

10.1.3.95 was failed-over during test.


[Destination]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/66037d6b/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/74f90c6a/10.1.3.96-6162014-1228-couch.tar.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/89d810de/10.1.3.96-6162014-1239-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8260baf2/10.1.3.97-6162014-1240-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8d1da3e3/10.1.3.97-6162014-1229-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b59b07fc/10.1.3.97-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/1bf11bfc/10.1.3.99-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2e4492f9/10.1.3.99-6162014-1242-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b9febae4/10.1.3.99-6162014-1229-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/4beb391b/10.1.2.12-6162014-1229-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/630b4a6b/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/7642751c/10.1.2.12-6162014-1241-diag.zip
Is this a Regression?: Yes

 Description   
http://qa.hq.northscale.net/job/centos_x64--01_02--XDCR_SSL-P0/10/consoleFull

[Test]
./testrunner -i centos_x64--01_01--uniXDCR_biXDCR-P0.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,demand_encryption=1 -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,sasl_buckets=2,default_bucket=False,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=source,timeout=180,GROUP=P1


[Test Error]
[2014-06-16 12:25:05,339] - [task:443] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.99:8091',sasl_bucket_2 bucket
[2014-06-16 12:25:05,383] - [xdcrbasetests:1335] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.1.3.96
[2014-06-16 12:25:05,555] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 461
[2014-06-16 12:25:05,702] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 519
[2014-06-16 12:25:05,703] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:15,862] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 454
[2014-06-16 12:25:16,003] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 485
[2014-06-16 12:25:16,004] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:26,165] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 433
[2014-06-16 12:25:26,260] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 485
[2014-06-16 12:25:26,260] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:36,408] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 463
[2014-06-16 12:25:36,561] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 491
[2014-06-16 12:25:36,562] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:46,739] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 422
[2014-06-16 12:25:46,884] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 454
[2014-06-16 12:25:46,884] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:25:57,032] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 415
[2014-06-16 12:25:57,166] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 475
[2014-06-16 12:25:57,167] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:07,316] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 396
[2014-06-16 12:26:07,467] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 458
[2014-06-16 12:26:07,468] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:17,627] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 388
[2014-06-16 12:26:17,765] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 442
[2014-06-16 12:26:17,765] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:27,914] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 354
[2014-06-16 12:26:28,050] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 431
[2014-06-16 12:26:28,050] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:38,198] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 343
[2014-06-16 12:26:38,343] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 432
[2014-06-16 12:26:38,344] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:48,496] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 362
[2014-06-16 12:26:48,643] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 415
[2014-06-16 12:26:48,644] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:26:58,798] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 320
[2014-06-16 12:26:58,942] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 373
[2014-06-16 12:26:58,942] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:09,110] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 308
[2014-06-16 12:27:09,257] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 354
[2014-06-16 12:27:09,257] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:19,414] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 269
[2014-06-16 12:27:19,568] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 352
[2014-06-16 12:27:19,568] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:29,659] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 255
[2014-06-16 12:27:29,767] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 271
[2014-06-16 12:27:29,768] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:39,936] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 262
[2014-06-16 12:27:40,079] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 271
[2014-06-16 12:27:40,079] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:27:50,239] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 226
[2014-06-16 12:27:50,386] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 274
[2014-06-16 12:27:50,387] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:28:00,555] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_1 is 235
[2014-06-16 12:28:00,740] - [xdcrbasetests:1344] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket sasl_bucket_2 is 255
[2014-06-16 12:28:00,740] - [xdcrbasetests:366] INFO - sleep for 10 secs. ...
[2014-06-16 12:28:10,744] - [xdcrbasetests:1354] ERROR - Timeout occurs while waiting for mutations to be replicated
..
..
..
[2014-06-16 12:28:15,721] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo451 =====
[2014-06-16 12:28:15,722] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:1
[2014-06-16 12:28:15,723] - [task:1203] ERROR - cas mismatch: Source cas:16542222942424, Destination cas:16940491943424, Error Count:2
[2014-06-16 12:28:15,724] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542222942424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,725] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940491943424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,827] - [data_helper:289] INFO - creating direct client 10.1.3.99:11210 sasl_bucket_2
[2014-06-16 12:28:15,828] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2790 =====
[2014-06-16 12:28:15,828] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:3
[2014-06-16 12:28:15,829] - [task:1203] ERROR - cas mismatch: Source cas:16543128831424, Destination cas:16954143372424, Error Count:4
[2014-06-16 12:28:15,829] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543128831424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,830] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954143372424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,850] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo523 =====
[2014-06-16 12:28:15,851] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:5
[2014-06-16 12:28:15,856] - [task:1203] ERROR - cas mismatch: Source cas:16542229414424, Destination cas:16940869390424, Error Count:6
[2014-06-16 12:28:15,856] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542229414424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,856] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940869390424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,871] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo1286 =====
[2014-06-16 12:28:15,874] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:7
[2014-06-16 12:28:15,875] - [task:1203] ERROR - cas mismatch: Source cas:16542614464424, Destination cas:16945939925424, Error Count:8
[2014-06-16 12:28:15,875] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542614464424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:15,876] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16945939925424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,045] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo131 =====
[2014-06-16 12:28:16,045] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:9
[2014-06-16 12:28:16,046] - [task:1203] ERROR - cas mismatch: Source cas:16542224457424, Destination cas:16938972892424, Error Count:10
[2014-06-16 12:28:16,047] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542224457424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,047] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16938972892424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,125] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo933 =====
[2014-06-16 12:28:16,126] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:11
[2014-06-16 12:28:16,126] - [task:1203] ERROR - cas mismatch: Source cas:16542224736424, Destination cas:16943647131424, Error Count:12
[2014-06-16 12:28:16,127] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542224736424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,127] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16943647131424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,132] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2680 =====
[2014-06-16 12:28:16,133] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:13
[2014-06-16 12:28:16,133] - [task:1203] ERROR - cas mismatch: Source cas:16543137791424, Destination cas:16953636997424, Error Count:14
[2014-06-16 12:28:16,134] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543137791424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,135] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16953636997424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,174] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2096 =====
[2014-06-16 12:28:16,175] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:15
[2014-06-16 12:28:16,175] - [task:1203] ERROR - cas mismatch: Source cas:16543128601424, Destination cas:16950500665424, Error Count:16
[2014-06-16 12:28:16,176] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543128601424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,177] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16950500665424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,228] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2022 =====
[2014-06-16 12:28:16,228] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:17
[2014-06-16 12:28:16,229] - [task:1203] ERROR - cas mismatch: Source cas:16543131209424, Destination cas:16949985515424, Error Count:18
[2014-06-16 12:28:16,230] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543131209424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,230] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16949985515424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,240] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2146 =====
[2014-06-16 12:28:16,240] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:19
[2014-06-16 12:28:16,241] - [task:1203] ERROR - cas mismatch: Source cas:16543131719424, Destination cas:16950745984424, Error Count:20
[2014-06-16 12:28:16,241] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543131719424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,242] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16950745984424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,329] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo479 =====
[2014-06-16 12:28:16,330] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:21
[2014-06-16 12:28:16,331] - [task:1203] ERROR - cas mismatch: Source cas:16542264845424, Destination cas:16940633628424, Error Count:22
[2014-06-16 12:28:16,332] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542264845424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,333] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16940633628424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,343] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo700 =====
[2014-06-16 12:28:16,344] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:23
[2014-06-16 12:28:16,344] - [task:1203] ERROR - cas mismatch: Source cas:16542254202424, Destination cas:16941821299424, Error Count:24
[2014-06-16 12:28:16,345] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542254202424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,346] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16941821299424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,352] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo1316 =====
[2014-06-16 12:28:16,352] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:25
[2014-06-16 12:28:16,353] - [task:1203] ERROR - cas mismatch: Source cas:16542621575424, Destination cas:16946095632424, Error Count:26
[2014-06-16 12:28:16,353] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542621575424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,354] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16946095632424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,381] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo603 =====
[2014-06-16 12:28:16,382] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:27
[2014-06-16 12:28:16,382] - [task:1203] ERROR - cas mismatch: Source cas:16542268666424, Destination cas:16941311155424, Error Count:28
[2014-06-16 12:28:16,383] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16542268666424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,383] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16941311155424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,501] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2957 =====
[2014-06-16 12:28:16,502] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:29
[2014-06-16 12:28:16,503] - [task:1203] ERROR - cas mismatch: Source cas:16543130010424, Destination cas:16954877533424, Error Count:30
[2014-06-16 12:28:16,504] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543130010424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,505] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954877533424, 'flags': 0, 'expiration': 0}


[Test Steps]
1. Setup 3 Node Source and 4 Node Destination Cluster.
2. Create buckets sasl_bucket_1, sasl_bucket_2 each side.
3. Setup CAPI mode Bi-directional xdcr for each bucket.
4. Load 10000 items on each side each buckets.
5. Failover and Rebalance one Source node.
6. Perform 30% update and 30% delete on Source Cluster each bucket.
7. Perform 30% update on Destination cluster each bucket.
8. Verify items.
    a) There were some items left of replication queue.
    b) Above caused not all mutations to be replicated.


Created separate bug for 3.0 XDCR UPR than MB-9707.
Outbound mutations not goes to 0 caused mutations not replicated and eventually verified by meta data mismatch between cluster.
After MB-9707, test cases were modified to proceed even replication_changes_left is not 0 for 5 minutes.





 Comments   
Comment by Aruna Piravi [ 17/Jun/14 ]
Hi Sangharsh, Can you also pls indicate the number of items you find on source and destination so we get an idea what % is not replicated? Thanks.
Comment by Sangharsh Agarwal [ 17/Jun/14 ]
Aruna,

Here A <------> B Bi-direction XDCR taken place. Where A (Source cluster) and B(Destination cluster). Initial 10000K items were replicated successfully on either side. Mutations (30% updates as mentioned Step-7) from B -> A were not replicated completely.

Nodes in Cluster A -> 10.1.3.93 (Master), 10.1.3.94, 10.1.3.95 (Failover node)
Nodes in Cluster B -> 10.1.3.96, 10.1.3.97, 10.1.3.99, 10.1.2.12


In below Logs, Read "Source meta data" as Cluster A and "Dest meta data" as Cluster B.

[2014-06-16 12:28:16,501] - [task:1202] ERROR - ===== Verifying rev_ids failed for key: loadTwo2957 =====
[2014-06-16 12:28:16,502] - [task:1203] ERROR - seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:29
[2014-06-16 12:28:16,503] - [task:1203] ERROR - cas mismatch: Source cas:16543130010424, Destination cas:16954877533424, Error Count:30
[2014-06-16 12:28:16,504] - [task:1204] ERROR - Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 16543130010424, 'flags': 0, 'expiration': 0}
[2014-06-16 12:28:16,505] - [task:1205] ERROR - Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 16954877533424, 'flags': 0, 'expiration': 0}


So after step-7 seqno number of key loadTwo2957 becomes 2 on Cluster B and remains 1 on Cluster A.

Note: Items count are not printed in the test logs, but can be find out from the data files.



Comment by Aruna Piravi [ 17/Jun/14 ]
We do print active, replica item count in logs -

[2014-06-16 12:28:11,546] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,641] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:11,721] - [task:443] INFO - Saw curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:11,734] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,821] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:11,895] - [task:443] INFO - Saw vb_active_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:11,909] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:11,997] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:12,079] - [task:443] INFO - Saw vb_replica_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_1 bucket
[2014-06-16 12:28:12,096] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,184] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,289] - [task:443] INFO - Saw curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,304] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,402] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,487] - [task:443] INFO - Saw vb_active_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,501] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:12,592] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:12,672] - [task:443] INFO - Saw vb_replica_curr_items 17000 == 17000 expected on '10.1.3.93:8091''10.1.3.94:8091',sasl_bucket_2 bucket
[2014-06-16 12:28:12,773] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_1
[2014-06-16 12:28:12,865] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_1
[2014-06-16 12:28:12,974] - [task:1042] INFO - 20000 items will be verified on sasl_bucket_1 bucket
[2014-06-16 12:28:13,041] - [data_helper:289] INFO - creating direct client 10.1.3.93:11210 sasl_bucket_2
[2014-06-16 12:28:13,149] - [data_helper:289] INFO - creating direct client 10.1.3.94:11210 sasl_bucket_2
[2014-06-16 12:28:13,248] - [task:1042] INFO - 20000 items will be verified on sasl_bucket_2 bucket

So this is a case where source and destination item count is equal but some updates have not been propagated to the other side as can be seen from the meta data info.
Comment by Aleksey Kondratenko [ 17/Jun/14 ]
Have you tried same thing but without SSL? Have you tried something simpler ?
Comment by Sangharsh Agarwal [ 17/Jun/14 ]
Alk, Test is passed without SSL. It is a regular test.
Comment by Aruna Piravi [ 19/Jun/14 ]
Sangharsh is right, wasn't able to reproduce the problem with non-encrypted xdcr. However seeing it with encrypted xdcr.

Attaching logs with master trace enabled.
Comment by Aruna Piravi [ 19/Jun/14 ]
.186, .187 = C1
.188, .189 = C2
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
Looking just at /diag I see multiple memcached crashes. Do I understand correctly that you already filed Blocker bugs for every occurrence? If not then please make it your policy because memcached crashes in production are not acceptable.
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
Another thing regarding latest set of logs. Especially if something is stuck (rebalance or replication or views or anything) I need collectinfos capture _during_ test. And not after cleanup. In this case it appears to be after cleanup.
Comment by Aleksey Kondratenko [ 20/Jun/14 ]
See above. I need logs captured during replication that's not catching up.
Comment by Aruna Piravi [ 23/Jun/14 ]
Attaching logs captured when updates are not replicated.
Comment by Aruna Piravi [ 23/Jun/14 ]
Will check the memc crashes.
Comment by Aruna Piravi [ 23/Jun/14 ]
Seen with both capi and xmem protocols.
Comment by Aruna Piravi [ 23/Jun/14 ]
Hopefully this set of logs contain all the info you need. Thanks for your patience.
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
Don't have trace logs in this collectinfos.
Comment by Aruna Piravi [ 23/Jun/14 ]
They are present as 'ns_server.xdcr_trace.log' in all zip files.
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
May I have same test but xmem instead of capi ? Also may I have data files too just in case?
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
I see some evidence in xdcr_trace logs that recent versions of missing items _are_ being sent to other side. Same evidence with xmem will exclude capi layer and then I'll be able to pass it to ep-engine guys.
Comment by Aruna Piravi [ 23/Jun/14 ]
xmem cbcollect and data files - https://s3.amazonaws.com/bugdb/jira/MB-11440/xmem.tar
Comment by Aruna Piravi [ 23/Jun/14 ]
Attaching revIDs_xmem.txt = keys that are missing updates
Comment by Aleksey Kondratenko [ 23/Jun/14 ]
Please do another xmem run with build that has http://review.couchbase.org/38728 (which I merged few moments ago).
Comment by Sangharsh Agarwal [ 24/Jun/14 ]
Alk, Please find the logs after your merge on build 3.0.0-865:

[Test Logs]
https://friendpaste.com/5bHM8T5WTALfLptXwOyMAd

[Source]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2a3ac6d9/10.1.3.93-6242014-238-couch.tar.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8105eeb6/10.1.3.93-6242014-242-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/daf50fef/10.1.3.93-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/4031c179/10.1.3.94-6242014-238-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/797b1631/10.1.3.94-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/e5b654de/10.1.3.94-6242014-243-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/3be890cf/10.1.3.95-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/9ce3aedb/10.1.3.95-6242014-244-diag.zip

[Destination]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/48bb6a6f/10.1.3.96-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/6382ff46/10.1.3.96-6242014-245-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f1fec989/10.1.3.96-6242014-239-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/10ca436d/10.1.3.97-6242014-244-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/21438682/10.1.3.97-6242014-239-couch.tar.gz
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-11440/a5e2d504/10.1.3.97-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/01235b22/10.1.3.99-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/57abd92c/10.1.3.99-6242014-246-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f79db0e4/10.1.3.99-6242014-239-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/2eba987e/10.1.2.12-6242014-247-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/b9dbd837/10.1.2.12-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f9f8a5e4/10.1.2.12-6242014-239-couch.tar.gz
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Sangarsh, please be aware, that next time I'll bounce the ticket back to you if you don't start capturing collectinfos _during_ xdcr and _not after_ you've cleaned everything up.
Comment by Aruna Piravi [ 24/Jun/14 ]
Thanks Sangharsh. Alk pls let us know if you need anything else. Thanks.
Comment by Aruna Piravi [ 24/Jun/14 ]
Ok, will get you new set of logs.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
It is possible that we're dealing with two distinct bugs here. Particularly, in logs from Sangharsh I'm not seeing second revision of loadTwo1906 even considered for replication. Which might be possible for example if some upr bit gets broken. But in yesterday's logs from Aruna I saw expected revisions replicated.

So:

1) Aruna, please rerun in your environment and give me new logs please.

2) Sangharsh, please rerun and give me logs that are captured _before_ you clean up everything.
Comment by Aruna Piravi [ 24/Jun/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11440/xmem2.tar
Comment by Aruna Piravi [ 24/Jun/14 ]
xmem2_revIDs.txt attached
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Seeing evidence that we do send those newer revisions but they are refused by ep-engine's conflict resolution:

I.e:

2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] ===== Verifying rev_ids failed for key: loadTwo1614 =====
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] seqno mismatch: Source seqno:1, Destination seqno:2, Error Count:97
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] cas mismatch: Source cas:7329801498721320, Destination cas:7329957630538902, Error Count:98
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] Source meta data: {'deleted': 0, 'seqno': 1, 'cas': 7329801498721320, 'flags': 0, 'expiration': 0}
2014-06-24 11:59:32 | ERROR | MainProcess | load_gen_task | [task._check_key_revId] Dest meta data: {'deleted': 0, 'seqno': 2, 'cas': 7329957630538902, 'flags': 0, 'expiration': 0}


and corresponding traces:

./cbcollect_info_ns_1@10.3.4.186_20140624-190117/ns_server.xdcr_trace.log:399748:{"pid":"<0.32475.4>","type":"missing","ts":1403636156.392065,"startTS":1403636156.392047,"k":[["loadTwo2347",19,2],["loadTwo1784",20,2],["loadTwo1614",21,2],["loadTwo1166",22,2],["loadTwo89",23,2]],"loc":"xdc_vbucket_rep_worker:find_missing:238"}
./cbcollect_info_ns_1@10.3.4.186_20140624-190117/ns_server.xdcr_trace.log:399766:{"pid":"<0.32475.4>","type":"xmemSetMetas","ts":1403636156.39351,"ids":["loadTwo89","loadTwo1166","loadTwo1614","loadTwo1784","loadTwo2347"],"statuses":["key_eexists","key_eexists","key_eexists","key_eexists","key_eexists"],"startTS":1403636156.392241,"loc":"xdc_vbucket_rep_xmem:flush_docs:60"}

I'll wait for Sangharsh logs to see if he's facing same problem or not and then I'll pass this to ep-engine team.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
Actually we can see that earlier replication sends this document just fine.

Also couch_dbdump of corresponding vbuckets reveal that both sides _have_ revision 2.

Please double check your testing code.
Comment by Aleksey Kondratenko [ 24/Jun/14 ]
See above
Comment by Aruna Piravi [ 24/Jun/14 ]
Indeed, I see the same from views. This is a test code issue. I will investigate. Thanks for your time.

[root@centos-64-x64 tmp]# diff <(curl http://Administrator:password@10.3.4.186:8092/sasl_bucket_1/_design/dev_doc/_view/sasl1?full_set=true&inclusive_end=false&stale=false&connection_timeout=60000&limit=1000000&skip=0) <(curl http://Administrator:password@10.3.4.188:8092/sasl_bucket_1/_design/dev_doc/_view/sasl1?stale=false&inclusive_end=false&connection_timeout=60000&limit=100000&skip=0 )
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
  0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 1456k 0 1456k 0 0 2169k 0 --:--:-- --:--:-- --:--:-- 2170k
100 1456k 0 1456k 0 0 1070k 0 --:--:-- 0:00:01 --:--:-- 1072k
[root@centos-64-x64 tmp]#
Comment by Anil Kumar [ 17/Jul/14 ]
Test code issue.

Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Sangharsh Agarwal [ 21/Jul/14 ]
Re-running the test again to understand this problem as it never appears with non-ssl XDCR, and continuously appearing with SSL XDCR failover case.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
As this issue, keep on coming on recent builds i.e. 973. I did some more investigation and before making any changes to test I would like to share some information with this bug, hoping it may help in further investigation.

Aruna/Alk,
    Aruna, your previous observation (last comment) was correct. Eventually updates are replicated successfully thats why you are getting right information from views. But test never read data from failed-over/rebalance-out nodes so its difficult to justify the point that test read data from wrong server and get obsolete meta data information:

[Points to Highlight]

Problem is always appearing with SSL-XDCR only. Same test is always passing with non-SSL XDCR test always.

[Test Conditions]
It is found that test is always reproducible if Failover side have 3 nodes and other side have 4 nodes. i.e. After failover+rebalance there should be 2 nodes. e.g. I tried this test with 4 nodes cluster and test passed.
After analysis of the test it found that updates are replicated to other side very slowly that caused this issue.

[Test Steps]
1. Have 3 nodes Source cluster (S) , 4 nodes Destination cluster (D).
2. Create two buckets sasl_bucket_1, sasl_bucket_2.
3. Setup SSL Bi-directional XDCR (CAPI) for both buckets.
4. Load 10000 items on each bucket and Source. keys with prefix "loadOne".
5. Load 10000 items on each bucket and Source. keys with prefix "loadTwo".
6. Failover+Rebalance one node at Source cluster.
7. Perform Updates (3000) and Delete(3000) items on Source. keys with prefix "loadOne".
8. Perform Updates (3000 items) on Destination. keys with prefix "loadTwo".
9. Test will fail with data mismatch error the data on Source (S) and Destination (D). It is the case that key from Destination (D i.e. non-failover side) i.e. "loadTwo" were not replicated when validation took place.


[Additional information]
1. Test with lesser number of items/updates are passed successfully.
2. Test with single bucket is passed with above mentioned items/mutations.

[Workaround]
Add 90 seconds additional sleep before verifying data or increase timeout to 5 minutes (from 3 minutes) to wait for outbound mutations to zero, which will ensure that all data is replicated from either side in bi-directional replication. But XDCR with UPR should be even more faster than previous XDCR. This test always passed with 2.5.1.
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Cluster is live for debugging:

[Source]
10.1.3.96
10.1.3.97
10.1.2.12

10.1.3.97 node were failed-over.

[Destination]
10.1.3.93
10.1.3.94
10.1.3.95
10.1.3.99

[Test Logs]
https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log
Comment by Sangharsh Agarwal [ 22/Jul/14 ]
Aruna,
    If analysis looks Ok to you, please assign to Dev for further investigation.
Comment by Aruna Piravi [ 22/Jul/14 ]
Hi Sangharsh,

I see some important info in the "workaround" section. You are raising a valid point. So if you waited for 90s more before verification and all items are correct, we are still replicating when we are expected to be done replicating. And if this did not occur in 2.5.1 on same VMs(we could compare mutation replication rates), we probably have a performance regression with encrypted xdcr in 3.0.

cc'ing Pavel for his input.

Thanks,
Aruna
Comment by Pavel Paulau [ 23/Jul/14 ]
My input:

If you want to report this issue as a performance regression then please make sure that all related functional bugs are resolved.
Also you need a reliable (and ideally simple) way to show slowness. E.g., set of results for 2.5.1 and set of results for 3.0.0, using consistent and reasonable environment.

We had many similar reports before 2.5.x releases. It's supercritical to minimize level of noise.
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Logs are copied:

[Source]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/0ab776fe/10.1.3.96-7232014-345-diag.zip
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/cb30c533/10.1.3.96-8091-diag.txt.gz
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d0c15a08/10.1.3.96-7232014-332-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/90c3aa7c/10.1.2.12-8091-diag.txt.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/9eca3909/10.1.2.12-7232014-333-couch.tar.gz
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d4b275f5/10.1.2.12-7232014-351-diag.zip

10.1.3.97 (Failover Node): https://s3.amazonaws.com/bugdb/jira/MB-11440/9b5f28ad/10.1.3.97-7232014-344-diag.zip
10.1.3.97 (Failover Node) : https://s3.amazonaws.com/bugdb/jira/MB-11440/b0833ce4/10.1.3.97-7232014-332-couch.tar.gz
10.1.3.97 (Failover Node) : https://s3.amazonaws.com/bugdb/jira/MB-11440/e2e3f28f/10.1.3.97-8091-diag.txt.gz

[Destination]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/49b4dd7d/10.1.3.93-8091-diag.txt.gz
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/e82f1a33/10.1.3.93-7232014-332-diag.zip
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ef58cc41/10.1.3.93-7232014-331-couch.tar.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/46f07617/10.1.3.94-8091-diag.txt.gz
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/c4462a8e/10.1.3.94-7232014-340-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-11440/f478617e/10.1.3.94-7232014-332-couch.tar.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/33dfe115/10.1.3.95-7232014-336-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/8a616c90/10.1.3.95-8091-diag.txt.gz
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-11440/d91fe96f/10.1.3.95-7232014-331-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/816ae8c0/10.1.3.99-8091-diag.txt.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ac259ae9/10.1.3.99-7232014-332-couch.tar.gz
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-11440/ef17b4ec/10.1.3.99-7232014-348-diag.zip

Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Aruna, There are some logs found which could be useful to analyze this issue:

[Test Logs]
https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log

[Faiover Period]

2014-07-22 05:07:38 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 10.1.3.97:8091
2014-07-22 05:07:40 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@10.1.3.97 successful
2014-07-22 05:07:40 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
2014-07-22 05:07:40 | INFO | MainProcess | test_thread | [biXDCR.load_with_failover] Failing over Source Non-Master Node 10.1.3.97:8091
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.97&user=Administrator&knownNodes=ns_1%4010.1.3.97%2Cns_1%4010.1.3.96%2Cns_1%4010.1.2.12
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance operation started
2014-07-22 05:07:41 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 0 %
2014-07-22 05:07:51 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 18.4769405082 %
2014-07-22 05:08:01 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 46.11833859 %
2014-07-22 05:08:11 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 71.3362916906 %
2014-07-22 05:08:21 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 99.121522694 %
2014-07-22 05:08:32 | INFO | MainProcess | Cluster_Thread | [task.check] rebalancing was completed with progress: 100% in 50.1381921768 sec

Failover was completed at 5:08:26 from logs in 10.1.3.96.


I checked the xdcr finish time on the destination cluster:

Node XDCR Finish Time (Last time of xdcr.log file)
-------- --------------------------
10.1.3.93 -> 5:11:26.898 On time as per data load
10.1.3.94 -> 5:11:26.703 On time as per data load
10.1.3.95: -> 5:13:58.970 -> Delay
10.1.3.99: -> 5:17:04.459 -> Replication finished lastly on this node.


I can see many errors xdcr.log file on 10.1.3.99 which shows checkpoint commit failure, as it were still re-trying to commit on failedover node i.e. 10.1.3.97.

[xdcr:debug,2014-07-22T5:10:03.573,ns_1@10.1.3.99:<0.3265.0>:xdc_vbucket_rep_ckpt:do_send_retriable_http_request:215]Got http error doing POST to https://Administrator:password@10.1.3.97:18092/_commit_for_checkpoint. Will retry. Error: {{tls_alert,
                                                                                                                         "unknown ca"},
                                                                                                                        [{lhttpc_client,
                                                                                                                          send_request,
                                                                                                                          1,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            199}]},
                                                                                                                         {lhttpc_client,
                                                                                                                          execute,
                                                                                                                          9,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            151}]},
                                                                                                                         {lhttpc_client,
                                                                                                                          request,
                                                                                                                          9,
                                                                                                                          [{file,
                                                                                                                            "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                           {line,
                                                                                                                            83}]}]}

..
..

[xdcr:error,2014-07-22T5:16:55.896,ns_1@10.1.3.99:<0.4040.0>:xdc_vbucket_rep_ckpt:send_post:197]Checkpointing related POST to https://Administrator:password@10.1.3.97:18092/_commit_for_checkpoint failed: {{tls_alert,
                                                                                                              "unknown ca"},
                                                                                                             [{lhttpc_client,
                                                                                                               send_request,
                                                                                                               1,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 199}]},
                                                                                                              {lhttpc_client,
                                                                                                               execute,
                                                                                                               9,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 151}]},
                                                                                                              {lhttpc_client,
                                                                                                               request,
                                                                                                               9,
                                                                                                               [{file,
                                                                                                                 "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                                                                {line,
                                                                                                                 83}]}]}


10.1.3.99 first received this error 5:10:03 but still it were trying to commit checkpoint on failed over node while node failover were done much before that time.
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
Clusters are still live, if you need for investigation.
Comment by Aruna Piravi [ 23/Jul/14 ]
Sangharsh, 404 errors are ok and are expected. These are seen one per vbucket. So for every vbucket that receives mutation, source will try to reach the dest node which is now failed over. It will get this error and only then retry with new IP. Different vbuckets may receive mutations at different times, so there's no strict time window for 404 errors due to changed dest node.

We dont need logs at this point. What may help is - a comparison with and without ssl against 2.5.1. and 3.0 on the same VMs. We don't have the best of hardware, but this is the little we can do to see if this is indeed a performance problem.

Thanks!
Comment by Sangharsh Agarwal [ 23/Jul/14 ]
If you see the test there were 3000 updation only at the cluster, do you think it will take 7 minutes to get information of mutations. In addition to that, why node is retrying to communicate with fail-over node for next 7 minutes, while Master node i.e. 10.1.3.96 has corrected the CAPI request within 2 minutes to remaining nodes only.

I think along with comparison between 2.5.1 and 3.0. Please compare the logs between non-SSL and SSL XDCR for this same test and same build.
Comment by Aruna Piravi [ 25/Jul/14 ]
Alk,

Can you pls look at the logs @ https://s3.amazonaws.com/bugdb/jira/MB-11440/05bd3d96/test.log

Scenario-

1. 3 * 4 bi-xdcr (ssl) , 3 node cluster is C1, 4 node cluster C2.
2. failover one node from C1
3. C2 takes longer(7 minutes) to send all mutations to C1. However in same test with no-ssl xdcr, C2 takes only 2 mins to complete replication.

So the question here is why are we seeing this difference. Do you see anything notable on 10.1.3.99 that's causing the delay? Also, Sangharsh says the problem is specific to this setup (3*4).

 Node XDCR Finish Time (Last time of xdcr.log file)
-------- --------------------------
10.1.3.93 5:11:26.898 On time as per data load
10.1.3.94 5:11:26.703 On time as per data load
10.1.3.95 5:13:58.970 -> Delay
10.1.3.99 5:17:04.459 -> Replication finished lastly on this node.


Thanks,
Aruna
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
1) test.log doesn't tell me anything

2) _please_ pleeeese do not re-purpose tickets like this. Originally this was data loss bug and now suddenly you change it massively

3) If you have reason to believe that 3.0 with ssl is much slower than 2.5.1 with ssl I need evidence _and_ fresh ticket.
Comment by Aruna Piravi [ 25/Jul/14 ]
Sangharsh,

1. Let's close this bug
2. Open another one with ssl vs non-ssl in 2.5.1 and 3.0 (on same vms) as discussed earlier which would still not be "reliable enough" for performance benchmarking/testing OR simply leave it for Pavel to test on his physical machines. He does testing on ssl vs non-ssl anyhow. Showfast does not show any regression between ssl and non-ssl.
3. Please just increase timeout in test for now to allow replication to continue to completion.

Thanks




[MB-11731] Persistence to disk suffers from bucket compaction Created: 15/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_vs_write_queue.png     PNG File disk_write_queue.png     PNG File drain_rate.png    
Issue Links:
Duplicate
is duplicated by MB-11769 Major regression in write performance... Closed
Relates to
relates to MB-11732 Auto-compaction doesn't honor fragmen... Closed
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/357/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

This is not a new problem, we could observe it for many months.

From attached charts you can see that drain rate (and disk write queue correspondingly) are antiphased, every 30-40 minutes one of buckets drains faster.

On average size of disk write queue doesn't differ from 2.5.x but peak values are slightly higher.


 Comments   
Comment by Pavel Paulau [ 15/Jul/14 ]
Actually drain rate suffers from slower compaction, see also MB-11732.
Comment by Sundar Sridharan [ 15/Jul/14 ]
Artem, local testing reveals that spawning multiple database compactions in parallel makes it faster, could you please explore if we can somehow trigger multiple compactions in parallel preferably on the same shard?
shardId = vbucketId % 4
thanks
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th
Comment by Pavel Paulau [ 21/Jul/14 ]
It's actually getting worse, see MB-11769 for details.

Also I don't think that parallel compaction is the right solution for this particular.
We do need this feature but we cannot use it as a fix.
Comment by Pavel Paulau [ 21/Jul/14 ]
Promoting to blocker and assigning back to ep-engine team due to MB-11769.
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, could you please clarify on what you meant by its actually getting worse - did you mean a regression where a recent build (Jul18th or later) shows poorer drain rate than a 3.0 build before? thanks
Comment by Pavel Paulau [ 21/Jul/14 ]
Yes, when compaction starts disk write queue increases even higher in recent builds.

MB-11769 has more details about difference.
Comment by Sundar Sridharan [ 21/Jul/14 ]
Pavel, I see that you are using build 988 which contains the writer-limit of 8 threads. As one might expect, limiting writers seems to have an impact on drain rate. To confirm this behavior, if possible, could you please experiment with a larger setting for the max_num_writers using
cbepctl set flush_param max_num_writers 12 (or a different number) to see if this improves the drain rate (note that this may be at an expense of increased cpu usage)?
thanks
Comment by Pavel Paulau [ 21/Jul/14 ]
I will try that.

But number of writers is still higher in 3.0; slower than in 2.5.1 persistence doesn't make a lot of sense.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue and MB-11799 (I think both issues are caused by the same root cause):

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.




[MB-11799] Bucket compaction causes massive slowness of UPR consumers Created: 23/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Pavel Paulau Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Build 3.0.0-1005

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2680 v2 (40 vCPU)
Memory = 256 GB
Disk = RAID 10 SSD

Attachments: PNG File compaction_b1-vs-compaction_b2-vs-ep_upr_replica_items_remaining-vs_xdcr_lag.png    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/xdcr-5x5/386/artifact/
Is this a Regression?: Yes

 Description   
5 -> 5 UniDir, 2 buckets x 500M x 1KB, 10K SETs/sec, LAN

Similar to MB-11731 which is getting worse and worse. But now compaction affects intra-cluster replication and XDCR latency as well:

"ep_upr_replica_items_remaining" reaches 1M during compaction
"xdcr latency" reaches 5 minutes during compaction.

See attached charts for details. Full reports:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c1_300-1005_a66_access
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=atlas_c2_300-1005_6d2_access

One important change that we made recently - http://review.couchbase.org/#/c/39647/.

The last known working builds is 3.0.0-988.

 Comments   
Comment by Pavel Paulau [ 23/Jul/14 ]
Chiyoung,

This is really critical regression. It affects many XDCR tests and also blocks many investigation/tuning efforts.
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix added for review at http://review.couchbase.org/39880 thanks
Comment by Chiyoung Seo [ 25/Jul/14 ]
I made several fixes for this issue:

http://review.couchbase.org/#/c/39906/
http://review.couchbase.org/#/c/39907/
http://review.couchbase.org/#/c/39910/

We will provide the toy build for Pavel.




[MB-11825] Rebalance may fail if cluster_compat_mode:is_node_compatible times out waiting for ns_doctor:get_node Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.2.0, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: customer, rebalance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Is this a Regression?: No

 Description   
Saw this in CBSE-1301:

 <0.2025.3344> exited with {{function_clause,
                            [{new_ns_replicas_builder,handle_info,
                              [{#Ref<0.0.4447.107509>,
                                [stale,
                                 {last_heard,{1406,315410,842219}},
                                 {now,{1406,315410,210848}},
                                 {active_buckets,
                                  ["user_reg","sentence","telemetry","common",
                                   "notifications","playlists","users"]},
                                 {ready_buckets,

which caused rebalance to fail.

The reason is that new_ns_replicas_builder doesn't have catch-all handle_info that's typical for gen_servers. And this message occurs because of the following call chain:

* new_ns_replicas_builder:init/1

* ns_replicas_builder_utils:spawn_replica_builder/5

* ebucketmigrator_srv:build_args

* cluster_compat_mode:is_node_compatible

* ns_doctor:get_node

ns_doctor:get_node handles timeout and returns empty list. So if this happens actual reply may be delivered later and be handled by handle_info. Which in this case is unable to do it.

3.0 is mostly immune to this particular chain of calls due to optimization:

commit 70badff90b03176b357cac4d03e40acc62f4861b
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Tue Oct 1 11:44:02 2013 -0700

    MB-9096: optimized is_node_compatible when cluster is compatible
    
    There's no need to check for particular node's compatibility with
    certain feature if entire cluster's mode is new enough.
    
    Change-Id: I9573e6b2049cb00d2adad709ba41ec5285d66a6b
    Reviewed-on: http://review.couchbase.org/29317
    Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
    Reviewed-by: Artem Stemkovski <artem@couchbase.com>


 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
http://review.couchbase.org/39908




[MB-11675] 40-50% performance degradation on append-heavy workload compared to 2.5.1 Created: 09/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Dave Rigby Assignee: Pavel Paulau
Resolution: Unresolved Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: OS X Mavericks 10.9.3
CB server 3.0.0-918 (http://packages.northscale.com/latestbuilds/3.0.0/couchbase-server-enterprise_x86_64_3.0.0-918-rel.zip)
Haswell MacBook Pro (16GB RAM)

Attachments: PNG File CB 2.5.1 revAB_sim.png     PNG File CB 3.0.0-918 revAB_sim.png     JPEG File ep.251.jpg     JPEG File ep.300.jpg     JPEG File epso.251.jpg     JPEG File epso.300.jpg     Zip Archive MB-11675.trace.zip     Zip Archive perf_report_result.zip     Zip Archive revAB_sim_v2.zip     Zip Archive revAB_sim.zip    
Issue Links:
Relates to
relates to MB-11623 test for performance regressions with... In Progress

 Description   
When running an append-heavy workload (modelling a social network address book, see below) the performance of CB has dropped from ~100K ops down to 50K ops compared to 2.5.1-1083 on OS X.

Edit: I see a similar (but slightly smaller - around 40% degradation on Linux (Ubuntu 14.04) - see comment below for details.

== Workload ==

revAB_sim - generates a model social network, then builds a representation of this in Couchbase. Keys are a set of phone numbers, values are lists of phone books which contain that phone number. (See attachment).

Configured for 8 client threads, 100,000 people (documents).

To run:

* pip install networkx
* Check revAB_sim.py for correct host, port, etc
* time ./revAB_sim.py

== Cluster ==

1 node, default bucket set to 1024MB quota.

== Runtimes for workload to complete ==


## CB-2.5.1-1083:

~107K op/s. Timings for workload (3 samples):

real 2m28.536s
real 2m28.820s
real 2m31.586s


## CB-3.0.0-918

~54K op/s. Timings for workload:

real 5m23.728s
real 5m22.129s
real 5m24.947s


 Comments   
Comment by Pavel Paulau [ 09/Jul/14 ]
I'm just curious, what does consume all CPU resources?
Comment by Dave Rigby [ 09/Jul/14 ]
I haven't had chance to profile it yet; certainly in both instances (fast / slow) the CPU is at 100% between the client workload and server.
Comment by Pavel Paulau [ 09/Jul/14 ]
Is memcached top consumer? or beam.smp? or client?
Comment by Dave Rigby [ 09/Jul/14 ]
memcached highest (as expected). From the 3.0.0 package (which I still have installed):

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG CMPRS VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW
34046 memcached 476.9 01:34.84 17/7 0 36 419 278M+ 277M+ 0B 0B 348M 2742M 34046 33801 running 501 73397+ 160 67 26 13304643+ 879+ 4070244+
34326 Python 93.4 00:18.57 9/1 0 25 418 293M+ 293M+ 0B 0B 386M 2755M 34326 1366 running 501 77745+ 399 70 28 15441263+ 629 5754198+
0 kernel_task 71.8 00:14.29 95/9 0 2 949 1174M+ 30M 0B 0B 295M 15G 0 0 running 0 42409 0 57335763+ 52435352+ 0 0 278127194+
...
32800 beam.smp 8.5 00:05.61 30/4 0 49 330 155M- 152M- 0B 0B 345M- 2748M- 32800 32793 running 501 255057+ 468 149 30 6824071+ 1862753+ 1623911+


Python is the workload generator.

I shall try to collect an Instruments profile of 3.0 and 2.5.1 to compare...
Comment by Dave Rigby [ 09/Jul/14 ]
Instruments profile of two runs:

Run 1: 3.0.0 (slow)
Run 2: 2.5.1 (fast)

I can look into the differences tomorrow if no-one else gets there first.


Comment by Dave Rigby [ 10/Jul/14 ]
Running on Linux (Ubuntu 14.04), 24 core Xeon, I see a similar effect, but the magnitude is not as bad - 40% performance drop.

100,000 documents with 4 worker threads, same bucket size (1024MB). (Note: worker threads was dropped to 4 as I couldn't get Python SDK to reliably connect with 8 threads at the same time).

## CB-3.0.0 (source build):

    83k op/s
    real 3m26.785s

## CB-2.5.1 (source build):

    133K op/s
    real 2m4.276s


Edit: Attached updated zip file as: revAB_sim_v2.zip
Comment by Dave Rigby [ 10/Jul/14 ]
Attaching the output of `perf report` for both 2.5.1 and 3.0.0 - perf_report_result.zip

There's nothing obvious jumping out at me, looks like quite a bit has changed between the two in ep_engine.
Comment by Dave Rigby [ 11/Jul/14 ]
I'm tempted to bump this to "blocker" considering it also affects Linux - any thoughts?
Comment by Pavel Paulau [ 11/Jul/14 ]
It's a product/release blocker, no doubt.

(though raising priority at this point will not move ticket to the top of the backlog due to other issues)
Comment by Dave Rigby [ 11/Jul/14 ]
@Pavel done :)
Comment by Abhinav Dangeti [ 11/Jul/14 ]
Think I should bring this up to people's notice that JSON detection has been moved to before items are set in memory, in 3.0. This could very well be the cause for this regression (as previously, we did do this JSON check but just before persistence).
This was part of the datatype related change, now required by UPR.
A HELLO protocol was introduced new in 3.0, which clients can invoke, there by letting the server know that clients would be setting the datatype themselves, in which case this JSON check wouldn't take place.
If a client doesn't invoke the HELLO command, then we would do JSON detection to set the datatype correctly.

However, recently, the HELLO was disabled as we weren't ready to handle compressed documents in view engine. This implied that we do a mandatory JSON check for every store operation, before setting the document even in memory.
Comment by Cihan Biyikoglu [ 11/Jul/14 ]
Thanks Abhinav. Can we try out if this simply resolves the issue quickly and if this is proven, revert this change?
Comment by David Liao [ 14/Jul/14 ]
I tried testing using the provided scripts with and without the json checking logic and there is no difference (on Mac and Ubuntu).

The total size of data is less than 200 MB with 100K items, it's about <2K per item which is not very big.
Comment by David Liao [ 15/Jul/14 ]
There might be an issue with general disk operation. I tested the set and it shows the same performance difference as append.
Pavel, have you seen any 'set' performance drop with 3.0? There is no rebalance involved just a single node in this test.
Comment by Pavel Paulau [ 16/Jul/14 ]
3.0 performs worse in CPU bound scenarios.
However Dave observed the same issue on system with 24 vCPU, which is kind of confusing to me.
Comment by Pavel Paulau [ 16/Jul/14 ]
Meanwhile I tried that script in my environment. I see no difference between and 2.5.1 and 3.0.

3.0.0-969: real 3m30.530s
2.5.1-1083: real 3m28.911s

Peak throughput is about 80K in both cases.

h/w configuration:

Platform = Physical
OS = CentOS 6.5
CPU = Intel Xeon E5-2630 (24 vCPU)
Memory = 64 GB
Disk = RAID 10 HDD

I used a standalone server as test client and regular packages.
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: I was essentially maxing out the system, so that probably explains why even with 24 cores I could see the issue.
Comment by Pavel Paulau [ 16/Jul/14 ]
Does it mean that 83/133K ops/sec saturate system with 24 cores?
Comment by Dave Rigby [ 16/Jul/14 ]
@Pavel: yes (including the client workload generator which was running on the same machine). I could possibly push it higher by increasing the client worker threads, but as mentioned I had some python SDK connection issues then.

Comment by Pavel Paulau [ 16/Jul/14 ]
Weird, in my case CPU utilization was less than 500% (IRIX mode).
Comment by David Liao [ 16/Jul/14 ]
I am using a 4-core/4 GB ubuntu VM for the test.

3.0
real 11m16.530s
user 2m33.814s
sys 2m35.779s
<30k ops

2.5.1
real 7m6.843s
user 2m6.693s
sys 2m2.696s
40k ops


During today's test, I found out that the disk queue fill/drain rate of 3.0.0 is much smaller than 3.0.0 (<2k vs 30k). The cpu usage is ~8% higher too but most increase is from system cpu usage (total cpu is almost maxed out on 3.0)

Pavel, can you check the disk queue fill/drain rate of your test and system vs user cpu usage?
Comment by Pavel Paulau [ 16/Jul/14 ]
David,

I will check disk stats tmrw. At the time I would recommend you to run benchmark with disabled persistence.
Comment by Pavel Paulau [ 18/Jul/14 ]
In my case drain rate is higher in 2.5.1 (80K vs. 5K) but size of write queue and rate of actual disk creates/updates is pretty much the same.

CPU utilization is 2x higher in 3.0 (400% vs. 200%).

However I don't understand how this information helps.
Comment by David Liao [ 21/Jul/14 ]
The drain rate may not be accurate on 2.5.1.
 
'iostat' shows about 2x 'tps' and 'KB_wrtn/s' for 3.0.0 vs 2.5.1. So it indicates far more disk activities in 3.0.0.

We need to find out what the extra disk activities are. Since ep-engine issues "set" to couchstore which then write to disk, we should
do a benchmark against the couchstore separately to isolate problem.

Pavel, is there a way to do a couchstore performance test?
Comment by Pavel Paulau [ 22/Jul/14 ]
Due to increased number of flusher threads 3.0.0 persist data faster, that must explain higher disk activity.

Once again, disabling disk persistence at all will eliminate "disk" factor (just as an experiment).

Also I don't think that we made any critical changes in couchstore, I don't expect any regression. Chiyoung may have some benchmarks.
Comment by David Liao [ 22/Jul/14 ]
I have played with different flusher threads but don't see any improvement in my own not-designed-for-serious-performance-testing environment.

Logically, if flusher threads runs faster, it means the total number of transfer to disk should finish in shorter time. My observation is higher TPS lasted during the entire testing period which itself is much longer than 2.5.1 which means the total disk activities TPS and date_writte_disk for the same amount of work load is much higher.

Do you mean using memcached bucket when you say "disabling disk"? That test shows much less performance degradation which means majority of the problem is not from the memcached layer.

I am not familiar with couchstore changes but there are indeed quite a lot and not sure who is responsible for that component. But still it needs to be tested just like any other component.
Comment by Pavel Paulau [ 23/Jul/14 ]
I meant disabling persistence to disk in couchbase bucket. E.g., using cbepctl.
Comment by David Liao [ 23/Jul/14 ]
I disabled persistence with cbepctl and reran the tests and got the same performance degradation:

3.0.0:
real 6m3.988s
user 1m59.670s
sys 2m1.363s
ops: 50k

2.5.1
real 4m18.072s
user 1m45.940s
sys 1m39.775s
ops: 70k

So it's not the disk related operations that caused this.
Comment by David Liao [ 24/Jul/14 ]
Dave, what profiling tool did u use to collect the profiling data you attached?
Comment by Dave Rigby [ 24/Jul/14 ]
I used Linux perf - see for example http://www.brendangregg.com/perf.html
Comment by David Liao [ 25/Jul/14 ]
attach perf report for ep.so 3.0.0
Comment by David Liao [ 25/Jul/14 ]
perf report ep.so 2.51
Comment by David Liao [ 25/Jul/14 ]
I attached memcached and ep.so cpu usage for both 3.00 and 2.5.1.

The 2.5.1 didn't use c++ atomics. I tested 3.0.0 without c++ atomics and see the following improvement: ~20% diff.

Both with persistence disabled.

2,51
real 7m38.581s
user 2m11.771s
sys 2m27.968s
ops: 35k+

3.0.0
real 9m15.638s
user 2m31.642s
sys 2m56.154s
ops: ~30k

There could be multiple things that we still need to look at: the threading change in 3.0.0 and thus figuring out the best number of thread for different work load and also why there are much more data being written to disk in this work load.

I am using my laptop doing the perf testing but this kind of test should be done using dedicated/controlled testing environment.
So the perf team should try test the following areas:
1. c++ atomics change.
2. different threading configuration for different type of workload
3. independent couchstore testing decoupled from ep-engine.





[MB-11821] Rename UPR to DCP in stats and loggings Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sundar Sridharan Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
ep-engine side changes are http://review.couchbase.org/#/c/39898/ thanks




[MB-11805] KV+ XDCR System test: Missing items in bi-xdcr only Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-998

Clusters
-----------
C1 : http://172.23.105.44:8091/
C2 : http://172.23.105.54:8091/
Free for investigation. Not attaching data files.

Steps
--------
1a. Load on both clusters till vb_active_resident_items_ratio < 50.
1b. Setup bi-xdcr on "standardbucket", uni-xdcr on "standardbucket1"
2. Access phase with 50% gets, 50%deletes for 3 hrs
3. Rebalance-out 1 node at cluster1
4. Rebalance-in 1 node at cluster1
5. Failover and remove node at cluster1
6. Failover and add-back node at cluster1
7. Rebalance-out 1 node at cluster2
8. Rebalance-in 1 node at cluster2
9. Failover and remove node at cluster2
10. Failover and add-back node at cluster2
11. Soft restart all nodes in cluster1 one by one
Verify item count

Problem
-------------
standardbucket(C1) <---> standardbucket(C2)
On C1 - 57890744 items
On C2 - 57957032 items
standardbucket1(C1) ----> standardbucket1(C2)
On C1 - 14053020 items
On C2 - 14053020 items

Total number of missing items : 66,288

Bucket priority
-----------------------
Both standardbucket and standardbucket1 have high priority.


Attached
-------------
cbcollect and list of keys that are missing on vb0


Missing keys
-------------------
Atleast 50-60 keys missing in every vbucket. Attaching all missing keys from vb0

vb0
-------
{'C1_node:': u'172.23.105.44',
'vb': 0,
'C2_node': u'172.23.105.54',
'C1_key_count': 78831,
 'C2_key_count': 78929,
 'missing_keys': 98}

     id: 06FA8A8B-11_110 deleted, tombstone exists
     id: 06FA8A8B-11_1354 present, report a bug!
     id: 06FA8A8B-11_1426 present, report a bug!
     id: 06FA8A8B-11_2175 present, report a bug!
     id: 06FA8A8B-11_2607 present, report a bug!
     id: 06FA8A8B-11_2797 present, report a bug!
     id: 06FA8A8B-11_3871 deleted, tombstone exists
     id: 06FA8A8B-11_4245 deleted, tombstone exists
     id: 06FA8A8B-11_4537 present, report a bug!
     id: 06FA8A8B-11_662 deleted, tombstone exists
     id: 06FA8A8B-11_6960 present, report a bug!
     id: 06FA8A8B-11_7064 present, report a bug!
     id: 3600C830-80_1298 present, report a bug!
     id: 3600C830-80_1308 present, report a bug!
     id: 3600C830-80_2129 present, report a bug!
     id: 3600C830-80_4219 deleted, tombstone exists
     id: 3600C830-80_4389 deleted, tombstone exists
     id: 3600C830-80_7038 present, report a bug!
     id: 3FEF1B93-91_2890 present, report a bug!
     id: 3FEF1B93-91_2900 present, report a bug!
     id: 3FEF1B93-91_3004 present, report a bug!
     id: 3FEF1B93-91_3194 present, report a bug!
     id: 3FEF1B93-91_3776 deleted, tombstone exists
     id: 3FEF1B93-91_753 present, report a bug!
     id: 52D6D916-120_1837 present, report a bug!
     id: 52D6D916-120_3282 present, report a bug!
     id: 52D6D916-120_3312 present, report a bug!
     id: 52D6D916-120_3460 present, report a bug!
     id: 52D6D916-120_376 deleted, tombstone exists
     id: 52D6D916-120_404 deleted, tombstone exists
     id: 52D6D916-120_4926 present, report a bug!
     id: 52D6D916-120_5022 present, report a bug!
     id: 52D6D916-120_5750 present, report a bug!
     id: 52D6D916-120_594 deleted, tombstone exists
     id: 52D6D916-120_6203 present, report a bug!
     id: 5C12B75A-142_2889 present, report a bug!
     id: 5C12B75A-142_2919 present, report a bug!
     id: 5C12B75A-142_569 deleted, tombstone exists
     id: 73C89FDB-102_1013 present, report a bug!
     id: 73C89FDB-102_1183 present, report a bug!
     id: 73C89FDB-102_1761 present, report a bug!
     id: 73C89FDB-102_2232 present, report a bug!
     id: 73C89FDB-102_2540 present, report a bug!
     id: 73C89FDB-102_4092 deleted, tombstone exists
     id: 73C89FDB-102_4102 deleted, tombstone exists
     id: 73C89FDB-102_668 deleted, tombstone exists
     id: 87B03DB1-62_3369 present, report a bug!
     id: 8DA39D2B-131_1949 present, report a bug!
     id: 8DA39D2B-131_725 deleted, tombstone exists
     id: A2CC835C-00_2926 present, report a bug!
     id: A2CC835C-00_3022 present, report a bug!
     id: A2CC835C-00_3750 present, report a bug!
     id: A2CC835C-00_5282 present, report a bug!
     id: A2CC835C-00_5312 present, report a bug!
     id: A2CC835C-00_5460 present, report a bug!
     id: A2CC835C-00_6133 present, report a bug!
     id: A2CC835C-00_6641 present, report a bug!
     id: A5C9F867-33_1091 present, report a bug!
     id: A5C9F867-33_1101 present, report a bug!
     id: A5C9F867-33_1673 present, report a bug!
     id: A5C9F867-33_2320 present, report a bug!
     id: A5C9F867-33_2452 present, report a bug!
     id: A5C9F867-33_4010 deleted, tombstone exists
     id: A5C9F867-33_4180 deleted, tombstone exists
     id: CD7B0436-153_3638 present, report a bug!
     id: CD7B0436-153_828 present, report a bug!
     id: D94DA3B2-51_829 present, report a bug!
     id: DE161E9D-40_1235 present, report a bug!
     id: DE161E9D-40_1547 present, report a bug!
     id: DE161E9D-40_2014 present, report a bug!
     id: DE161E9D-40_2184 present, report a bug!
     id: DE161E9D-40_2766 present, report a bug!
     id: DE161E9D-40_3880 deleted, tombstone exists
     id: DE161E9D-40_3910 deleted, tombstone exists
     id: DE161E9D-40_4324 deleted, tombstone exists
     id: DE161E9D-40_4456 deleted, tombstone exists
     id: DE161E9D-40_6801 present, report a bug!
     id: DE161E9D-40_6991 present, report a bug!
     id: DE161E9D-40_7095 present, report a bug!
     id: DE161E9D-40_7105 present, report a bug!
     id: DE161E9D-40_940 present, report a bug!
     id: E9F46ECC-22_173 deleted, tombstone exists
     id: E9F46ECC-22_2883 present, report a bug!
     id: E9F46ECC-22_2913 present, report a bug!
     id: E9F46ECC-22_3017 present, report a bug!
     id: E9F46ECC-22_3187 present, report a bug!
     id: E9F46ECC-22_3765 deleted, tombstone exists
     id: E9F46ECC-22_5327 present, report a bug!
     id: E9F46ECC-22_5455 present, report a bug!
     id: E9F46ECC-22_601 deleted, tombstone exists
     id: E9F46ECC-22_6096 present, report a bug!
     id: E9F46ECC-22_6106 present, report a bug!
     id: E9F46ECC-22_6674 present, report a bug!
     id: E9F46ECC-22_791 present, report a bug!
     id: ECD6BE16-113_2961 present, report a bug!
     id: ECD6BE16-113_3065 present, report a bug!
     id: ECD6BE16-113_3687 present, report a bug!
     id: ECD6BE16-113_3717 present, report a bug!

74 undeleted key(s) present on C2(.54) compared to C1(.44)











 Comments   
Comment by Aruna Piravi [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11805/C1.tar
https://s3.amazonaws.com/bugdb/jira/MB-11805/C2.tar
Comment by Aruna Piravi [ 25/Jul/14 ]
[7/23/14 1:40:12 PM] Aruna Piraviperumal: hi Mike, I see some backfill stmts like in MB-11725 but that doesn't lead to any missing items
[7/23/14 1:40:13 PM] Aruna Piraviperumal: 172.23.105.47
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.122.0>: Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:11:57.833959 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-e604cd19b3a376ccea68ed47556bd3d4 - (vb 271) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.1.txt:Tue Jul 22 16:12:35.180434 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-91ddbb7062107636d3c0556296eaa879 - (vb 379) Sending disk snapshot with start seqno 0 and end seqno 0
 
172.23.105.50


172.23.105.59


172.23.105.62


172.23.105.45
/opt/couchbase/var/lib/couchbase/logs/memcached.log.27.txt:Tue Jul 22 16:02:46.470085 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-2ad6ab49733cf45595de9ee568c05798 - (vb 421) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.48


172.23.105.52
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/babysitter.log:memcached<0.78.0>: Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:17.533338 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-d2c9937085d4c3f5b65979e7c1e9c3bb - (vb 974) Sending disk snapshot with start seqno 0 and end seqno 0
/opt/couchbase/var/lib/couchbase/logs/memcached.log.0.txt:Tue Jul 22 16:38:21.446553 PDT 3: (standardbucket1) UPR (Producer) eq_uprq:xdcr:standardbucket1-a3a462133cf1934c4bf47259331bf8a7 - (vb 958) Sending disk snapshot with start seqno 0 and end seqno 0

172.23.105.44
[7/23/14 1:56:12 PM] Michael Wiederhold: Having one of those isn't necessarily bad. Let me take a quick look
[7/23/14 2:02:49 PM] Michael Wiederhold: Ok this is good. I'll debug it a little bit more. Also, I don't necessarily expect that data loss will always occur because it's possible that the items could have already been replicated.
[7/23/14 2:03:38 PM] Aruna Piraviperumal: ok
[7/23/14 2:03:50 PM] Aruna Piraviperumal: I'm noticing data loss on standard bucket though
[7/23/14 2:04:19 PM] Aruna Piraviperumal: but no such disk snapshot logs found for 'standardbucket'
Comment by Mike Wiederhold [ 25/Jul/14 ]
For vbucket 0 in the logs I see that on the source side we have high seqno 102957, but on the destination we only have up to seqno 97705 so it appears that some items were not sent to the remote side. I also see in the logs that xdcr did request those items as shown in the log messages below.

memcached<0.78.0>: Wed Jul 23 12:30:02.506513 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 95291 and end seqno 0
memcached<0.78.0>: Wed Jul 23 13:30:01.683760 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) stream created with start seqno 95291 and end seqno 102957
memcached<0.78.0>: Wed Jul 23 13:30:02.070134 PDT 3: (standardbucket) UPR (Producer) eq_uprq:xdcr:standardbucket-9286724e8dbd0dfbe6f9308d093ede5e - (vb 0) Stream closing, 0 items sent from disk, 7666 items sent from memory, 102957 was last seqno sent
[ns_server:info,2014-07-23T13:30:10.753,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Wed Jul 23 13:30:10.552586 PDT 3: (standardbucket) UPR (Notifier) eq_uprq:xdcr:notifier:ns_1@172.23.105.44:standardbucket - (vb 0) stream created with start seqno 102957 and end seqno 0
Comment by Mike Wiederhold [ 25/Jul/14 ]
Alk,

See my comments above. Can you verify that all items were sent by the xdcr module correctly?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Let me quickly note that .tar is again in fact .tar.gz.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
missing:

a) data files (so that I can double-check your finding)

b) xdcr traces
Comment by Aruna Piravi [ 25/Jul/14 ]
1. For system tests, data files are huge, I did not attach them, the cluster is available.
2. xdcr traces were not enabled for this run, my apologies but we discard all info we have in hand? Another complete run will take 3 days. I'm not sure if we want to delay investigation for that long.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
There's no way to investigate such delicate issue without having at least traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
If all files are large you can at least attach that vbucket 0 where you found discrepancies.
Comment by Aruna Piravi [ 25/Jul/14 ]
> There's no way to investigate such delicate issue without having at least traces.
If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

>If all files are large you can at least attach that vbucket 0 where you found discrepancies.
 I can, if requested. The cluster was anyway left available.

Fine, let me do another run if there's no way to work around not having traces.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
>> > There's no way to investigate such delicate issue without having at least traces.

>> If it is that important, it probably makes sense to enable traces by default than having to do diag/eval? Customer logs are not going to have traces by default.

Not possible. We log potentially critical information. But _your_ tests are all semi-automated right? So for your automation it makes sense indeed to always enable xdcr tracing.
Comment by Aruna Piravi [ 25/Jul/14 ]
System test is completely automated. Only the post-test verification is not. But enabling tracing is now a part of the framework.




[MB-11784] GUI incorrectly displays vBucket number in stats Created: 22/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Ian McCloy Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: customer, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 251VbucketDisplay.png     PNG File 3fixVbucketDisplay.png    
Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Many customers are confused and have complained that on the "General Bucket Analytics" / "VBUCKET RESOURCES" page when listing the number of vBuckets, the GUI tries to convert the value 1024 default vBuckets to kilobytes, so it displays as 1.02k vBuckets (screen shot attached) . vBuckets shouldn't be parsed and always should show the full number.

I've changed the javascript to detect for vBuckets values and not parse them, (screen shot attached) . Will amend with a gerrit link when it's pushed to review.

 Comments   
Comment by Ian McCloy [ 22/Jul/14 ]
Code added to gerrit for review -> http://review.couchbase.org/#/c/39668/
Comment by Pavel Blagodov [ 24/Jul/14 ]
Hi Ian, here is clarification:
- kilo (or 'K') is a unit prefix in the metric system denoting multiplication by one thousand.
- kilobyte (or 'KB') is a multiple of the unit byte for digital information.
Comment by Ian McCloy [ 24/Jul/14 ]
Pavel thank you for clearing that up for me. Can you please explain when I see 1.02K vBuckets in the stats is that 1022, 1023 or 1024 active vBuckets, I'm not clear when I look at the UI.
Comment by Pavel Blagodov [ 25/Jul/14 ]
1.02K is expected value because currently UI truncates all analytic stats to three digits. Of course we may increase this number to four digits but this will be working only for K (not for M for example).
Comment by David Haikney [ 25/Jul/14 ]
@Pavel - Yes 1.02k is currently expected but the desire here is to change the UI to show "1024" instead of "1.02K". Fewer characters and more accuracy.




[MB-9222] standalone moxi-server -- no core on RHEL5 Created: 04/Oct/13  Updated: 25/Jul/14  Due: 20/Jun/14

Status: Open
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 1.8.1, 2.1.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Alexander Petrossian (PAF) Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: moxi
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
moxi init.d script contains
{code}
ulimit -c unlimited
{code}
line, which is supposed to allow core-dumps.

But then it uses OS /etc/.../functions function "daemon",
which overrides this ulimit.

One need to use
{code}
DAEMON_COREFILE_LIMIT=unlimited
{code}

environment variable, which will be handled by "daemon" function to do "ulimit -c unlimited".

 Comments   
Comment by Alexander Petrossian (PAF) [ 04/Oct/13 ]
once we did that we found out that moxi does chdir("/").
we've found in sources, that one can use "-r" command line switch to prevent "ch /" from happening.
plus "/var/run" folder, which is chdired to prior "daemon" command is no good anyway, it can not be written to by "moxi" user anyway.

I feel that being able to write cores is very important.
I agree that that may not be a good idea to enable that by default.

But now this is broken in 3 places, which is not good.

we suggest:
# cd /tmp (instead of cd /var/run) -- usually safe place to write by any user and exists on all systems.
# document -r command line switch (currently not documented in "moxi -h")
# add DAEMON_COREFILE_LIMIT before calling "daemon" function
Comment by Alexander Petrossian (PAF) [ 04/Oct/13 ]
regarding the default... we see there is core (and .dump) here:
[root@spms-lbas ~]# ls -la /opt/membase/var/lib/membase
-rw------- 1 membase membase 851615744 Feb 19 2013 core.12674
-rw-r----- 1 membase membase 12285899 Oct 4 17:45 erl_crash.dump

so maybe it is a good idea to enable it by default?

[root@spms-lbas ~]# file /opt/membase/var/lib/membase/core.12674
/opt/membase/var/lib/membase/core.12674: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'memcached'
[root@spms-lbas ~]#
Comment by Matt Ingenthron [ 20/Dec/13 ]
Steve: who is the right person to look at this these days?
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Iryna,

can you confirm if this is still happening in 3.0?
If it is, pls assign to Steve Y. Otherwised, resolve and close.
Thanks.
Comment by Steve Yen [ 25/Jul/14 ]
(scrubbing through ancient moxi bugs)

Hi Chris,
Looks like Alexander Petrossian has found the issue and the fix with the DAEMON_COREFILE_LIMIT env variable.

Can you incorporate this into moxi-init.d ?

Thanks,
Steve
Comment by Chris Hillery [ 25/Jul/14 ]
For prioritization purposes: Are we actually producing a standalone moxi product anymore? I'm unaware of any builds for it, so does it make sense to tag this bug "3.0" or indeed fix it at all?
Comment by Steve Yen [ 25/Jul/14 ]
Hi Chris,
We are indeed still supposed to provide a standalone moxi build (unless I'm out of date on news).

Priority-wise, IMHO, it's not highest (but that's just my opinion), as I believe folks can still get by with the standalone moxi from 2.5.1. That is moxi hasn't changed that very much functionally -- although Trond & team did a bunch of rewrite / refactoring to make it easier to build and develop (cmake, etc).

Cheers,
Steve




[MB-11786] {UPR}:: Rebalance-out hangs due to indexing stuck Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Seeing this issue in 991

1. Create 7 node cluster (10.6.2.144-150)
2. Create default Bucket
3. Add 1K items
4. Create 5 views and query
5. Rebalance out node 10.6.2.150

Step 4 and 5 are run in parallel

We see the rebalance hanging

I am seeing the following issue couchdb log in 10.6.2.150

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]View merger, revision mismatch for design document `_design/ddoc1', wanted 5-3275804e, got 5-3275804e
[couchdb:error,2014-07-22T13:37:52.699,ns_1@10.6.2.150:<0.217.0>:couch_log:error:44]Uncaught error in HTTP request: {throw,{error,revision_mismatch}}

Stacktrace: [{couch_index_merger,query_index,3,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couch_index_merger/src/couch_index_merger.erl"},
                  {line,75}]},
             {couch_httpd,handle_request,6,
                 [{file,
                      "/buildbot/build_slave/centos-5-x64-300-builder/build/build/couchdb/src/couchdb/couch_httpd.erl"},
                  {line,222}]},
             {mochiweb_http,headers,5,


Will attach logs ASAP

Test Case:: ./testrunner -i ubuntu_x64--109_00--Rebalance-Out.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,dgm=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_queries,nodes_out=1,blob_generator=False,value_size=1024,GROUP=OUT;BASIC;P0;FROM_2_0

 Comments   
Comment by Parag Agarwal [ 22/Jul/14 ]
The cluster is live if you want to investigate 10.6.2.144-150.
Comment by Parag Agarwal [ 22/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11786/991_logs.tar.gz
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
We're waiting for index to become updated.

I.e. I see a number of this:

     {<17674.13818.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007f64917effa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f6493d4f070 Return addr 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.9.246202>">>,<<"y(1) infinity">>,
                   <<"y(2) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.11899.5>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007f6493d4f0a8 Return addr 0x00007f6444879940 (janitor_agent:wait_index_updated/5 + 432)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {if_rebalance,<0.12798.5>,{wait_index_updated,112}}">>,
                   <<"y(2) {'janitor_agent-default','ns_1@10.6.2.144'}">>,
                   <<"y(3) Catch 0x00007f648df8ed78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d0 Return addr 0x00007f6444a49ea8 (ns_single_vbucket_mover:'-wait_index_updated/5-fun-0-'/5 + 104)">>,
                   <<>>,
                   <<"0x00007f6493d4f0d8 Return addr 0x00007f64917f38a0 (proc_lib:init_p/3 + 688)">>,
                   <<>>,
                   <<"0x00007f6493d4f0e0 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007f64917f38c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,2}]},
       {heap_size,610},
       {total_heap_size,1597},
       {links,[<17674.13242.5>]},
       {memory,13688},
       {message_queue_len,0},
       {reductions,806},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 22/Jul/14 ]
And this:
     {<0.13891.5>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f64448ad040 (capi_set_view_manager:'-do_wait_index_updated/4-lc$^0/1-0-'/3 + 64)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f643e3ac948 Return addr 0x00007f64448abb90 (capi_set_view_manager:do_wait_index_updated/4 + 848)">>,
                   <<"y(0) #Ref<0.0.9.246814>">>,
                   <<"y(1) #Ref<0.0.9.246821>">>,
                   <<"y(2) #Ref<0.0.9.246820>">>,<<"y(3) []">>,<<>>,
                   <<"0x00007f643e3ac970 Return addr 0x00007f64917f3ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) {<0.13890.5>,#Ref<0.0.9.246813>}">>,<<>>,
                   <<"0x00007f643e3ac980 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f64917f3ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,5}]},
       {heap_size,987},
       {total_heap_size,1974},
       {links,[]},
       {memory,16808},
       {message_queue_len,0},
       {reductions,1425},
       {trap_exit,false}]}
Comment by Parag Agarwal [ 22/Jul/14 ]
Still seeing the issue in 3.0.0-1000, centos 6x, ubuntu 1204
Comment by Sriram Melkote [ 22/Jul/14 ]
Sarath, can you please take a look?
Comment by Nimish Gupta [ 22/Jul/14 ]
The error in http query will not hang the rebalance. Http query error is happening since ddoc is updated.
I see there is error in getting mutation for partition 127 from ep-engine :

[couchdb:info,2014-07-22T13:37:59.764,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.866,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:37:59.967,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.070,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.171,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.272,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.373,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.474,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.575,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.676,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.777,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.878,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...
[couchdb:info,2014-07-22T13:38:00.979,ns_1@10.6.2.149:<0.15712.2>:couch_log:info:41]upr client (<0.15522.2>): Temporary failure on stream request on partition 127. Retrying...

There are lot of above continuous message till the logs are collected.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Yes, ep-engine kept on returning ETMPFAIL for partition 127's stream request. Hence, indexing never progressed.
EP-Engine team should take a look.
Comment by Sarath Lakshman [ 22/Jul/14 ]
Tue Jul 22 13:52:14.041453 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state
Tue Jul 22 13:52:14.143551 PDT 3: (default) UPR (Producer) eq_uprq:mapreduce_view: default _design/ddoc1 (prod/main) - (vb 127) Stream request failed because this vbucket is in backfill state

It seems that vbucket 127 is in backfill state and it never gets completed.
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/39896




[MB-11809] {UPR}:: Rebalance-in of 2 nodes is stuck when doing Ops Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Parag Agarwal Assignee: Mike Wiederhold
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
1014, centos 6x

Vms:: 10.6.2.144-150

1. Create 7 node cluster
2. Create default bucket
3. Add 400 K items
4. Do mutations and rebalance-out 2 nodes
5. Do mutations and rebalance-in 2 nodes

Step 5 leads to rebalance being stuck

Test Case:: ./testrunner -i ../palm.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,get-coredumps=True,force_kill_memached=False,verify_unacked_bytes=True,total_vbuckets=128,std_vbuckets=5 -t rebalance.rebalanceinout.RebalanceInOutTests.incremental_rebalance_out_in_with_mutation,init_num_nodes=3,items=400000,skip_cleanup=True,GROUP=IN_OUT;P0


 Comments   
Comment by Parag Agarwal [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11809/log.tar.gz
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Takeover request appear to be stuck. Thats on node .147.

     {<19779.11046.0>,
      [{registered_name,'replication_manager-default'},
       {status,waiting},
       {initial_call,{proc_lib,init_p,5}},
       {backtrace,[<<"Program counter: 0x00007f1b1d12ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007f1ad3083860 Return addr 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.0.169038>">>,<<"y(1) infinity">>,
                   <<"y(2) {takeover,78}">>,<<"y(3) '$gen_call'">>,
                   <<"y(4) <0.11353.0>">>,<<"y(5) []">>,<<>>,
                   <<"0x00007f1ad3083898 Return addr 0x00007f1acbd79e70 (replication_manager:handle_call/3 + 2840)">>,
                   <<"y(0) infinity">>,<<"y(1) {takeover,78}">>,
                   <<"y(2) 'upr_replicator-default-ns_1@10.6.2.146'">>,
                   <<"y(3) Catch 0x00007f1b198ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007f1ad30838c0 Return addr 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<"y(0) [{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FM\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.149',\"789\"}]">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<>>,
                   <<"0x00007f1ad30838d8 Return addr 0x00007f1b1d133ab0 (proc_lib:init_p_do_apply/3 + 56)">>,
                   <<"y(0) replication_manager">>,
                   <<"(1) {state,\"default\",dcp,[{'ns_1@10.6.2.145',\" \"},{'ns_1@10.6.2.146',\"FMN\"},{'ns_1@10.6.2.148',\"GH\"},{'ns_1@10.6.2.1">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) <0.11029.0>">>,
                   <<"y(4) {dcp_takeover,'ns_1@10.6.2.146',78}">>,
                   <<"y(5) {<0.11528.0>,#Ref<0.0.0.169027>}">>,
                   <<"y(6) Catch 0x00007f1b198d3570 (gen_server:handle_msg/5 + 272)">>,
                   <<>>,
                   <<"0x00007f1ad3083918 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) Catch 0x00007f1b1d133ad0 (proc_lib:init_p_do_apply/3 + 88)">>,
                   <<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,42}]},
       {heap_size,610},
       {total_heap_size,2208},
       {links,[<19779.11029.0>]},
       {memory,18856},
       {message_queue_len,2},
       {reductions,17287},
       {trap_exit,true}]}
Comment by Mike Wiederhold [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39894




[MB-11725] XDCR data loss after rebalance out Created: 14/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Aruna Piravi Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build - 3.0.0-957 centOS 6.x

Issue Links:
Duplicate
duplicates MB-11672 Missing items in index after rebalanc... Resolved
duplicates MB-11757 KV+XDCR System test: Possibly missing... Resolved
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Setup
--------
3*2 clusters, bi-xdcr on default bucket.
C1 = [.186, .187, .190]
C2 = [.188,.189]

Scenario
------------
1. Load 20K items onto cluster1, another 20K items onto cluster2.
2. In parallel, rebalance out 2 nodes from source.
3. Compare item count, data and meta-data

Testcase
-------------
./testrunner -i bixdcr.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True -t xdcr.rebalanceXDCR.Rebalance.async_rebalance_out,items=20000,rdirection=bidirection,async_load=True,ctopology=chain,expires=60,rebalance=source,num_rebalance=2

Missing keys
------------------
Source : 40000
Destination: 39996

4 missing keys on C2 (loaded at C1) :

< {"id":"loadOne2203","key":"loadOne2203","value":"1-002a5bb5180aa2e80000000000000000"},
< {"id":"loadOne2393","key":"loadOne2393","value":"1-002a5bb5607520d90000000000000000"},
< {"id":"loadOne2571","key":"loadOne2571","value":"1-002a5bb58ef3ddb40000000000000000"},
< {"id":"loadOne4133","key":"loadOne4133","value":"1-002a5bb86fbcdc440000000000000000"},

Attached
------------
1. cbcollect with xdcr trace logging
2. data files

Consistent?
----------------
No, reproduced 2 on 6 times.


 Comments   
Comment by Aruna Piravi [ 14/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11725/logs.zip
Comment by Thuan Nguyen [ 14/Jul/14 ]
What is the build number?
Comment by Aruna Piravi [ 14/Jul/14 ]
sorry, 3.0.0-957
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
xdcr_trace logs are missing
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
also it would be great if you could avoid naming .tar.gz files as if they're .zip
Comment by Aruna Piravi [ 14/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11725/diag.tar
https://s3.amazonaws.com/bugdb/jira/MB-11725/couch.tar -- thanks for the feedback, see if this naming helps.

Will paste list of missing keys.
Comment by Aruna Piravi [ 14/Jul/14 ]
Pls note, in this run, 6 keys are missing at C2.

< {"total_rows":40000,"rows":[
---
> {"total_rows":39994,"rows":[
11801d11800
< {"id":"loadOne2617","key":"loadOne2617","value":"1-000fed541566863b0000000000000000"},
11989d11987
< {"id":"loadOne2787","key":"loadOne2787","value":"1-000fed543c3c5d350000000000000000"},
13183d13180
< {"id":"loadOne3861","key":"loadOne3861","value":"1-000fed55f052eff00000000000000000"},
13621d13617
< {"id":"loadOne4255","key":"loadOne4255","value":"1-000fed5673c4d2d10000000000000000"},
13923d13918
< {"id":"loadOne4527","key":"loadOne4527","value":"1-000fed56b72108db0000000000000000"},
16637d16631
< {"id":"loadOne6970","key":"loadOne6970","value":"1-000fed59d4205c890000000000000000"},
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
Not sure about helps but you've upload .tar.gz as .tar this time
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
I've uploaded improving logger into gerrit: http://review.couchbase.org/39374

I'll need you to rerun with build containing this fix. But before that I need you to fix the clock on all the nodes. It must be synchronized across both clusters so that I correlate events in different logs.
Comment by Aleksey Kondratenko [ 14/Jul/14 ]
See above
Comment by Aruna Piravi [ 15/Jul/14 ]
Build -3.0.0-967

1. Synchronized clocks

Arunas-MacBook-Pro:testrunner apiravi$ ./scripts/ssh.py -i bixdcr.ini "date"
10.3.4.191
Tue Jul 15 16:38:44 PDT 2014

10.3.4.190
Tue Jul 15 16:38:45 PDT 2014

10.3.4.186
Tue Jul 15 16:38:45 PDT 2014

10.3.4.189
Tue Jul 15 16:38:45 PDT 2014

10.3.4.188
Tue Jul 15 16:38:45 PDT 2014

10.3.4.187
Tue Jul 15 16:38:45 PDT 2014


2. 16 items missing on destination this time

2014-07-15 16:38:17 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: curr_items 39984 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket
2014-07-15 16:38:17 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_active_curr_items 39984 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket
2014-07-15 16:38:17 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 39984 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket

Missing keys at C2 (initially loaded at C1)
---------------------------------------------------------

< {"total_rows":40000,"rows":[
---
> {"total_rows":39984,"rows":[
471d470
< {"id":"loadOne10419","key":"loadOne10419","value":"1-00103ccea7cd046d0000000000000000"},
659d657
< {"id":"loadOne10589","key":"loadOne10589","value":"1-00103ccebdb4947f0000000000000000"},
14095d14092
< {"id":"loadOne4682","key":"loadOne4682","value":"1-00103cc972439fd00000000000000000"},
14129d14125
< {"id":"loadOne4712","key":"loadOne4712","value":"1-00103cc97619298b0000000000000000"},
14378d14373
< {"id":"loadOne4937","key":"loadOne4937","value":"1-00103cc9a37506790000000000000000"},
14486d14480
< {"id":"loadOne5033","key":"loadOne5033","value":"1-00103cc9b4e206110000000000000000"},
14719d14712
< {"id":"loadOne5243","key":"loadOne5243","value":"1-00103cc9d9d3346a0000000000000000"},
15039d15031
< {"id":"loadOne5531","key":"loadOne5531","value":"1-00103cca1ac6de8d0000000000000000"},
15272d15263
< {"id":"loadOne5741","key":"loadOne5741","value":"1-00103cca3c0853580000000000000000"},
15519d15509
< {"id":"loadOne5964","key":"loadOne5964","value":"1-00103cca76f75e840000000000000000"},
15629d15618
< {"id":"loadOne6062","key":"loadOne6062","value":"1-00103cca8bd4ada40000000000000000"},
15796d15784
< {"id":"loadOne6212","key":"loadOne6212","value":"1-00103ccaa233d03c0000000000000000"},
15984d15971
< {"id":"loadOne6382","key":"loadOne6382","value":"1-00103ccac46f73f40000000000000000"},
16182d16168
< {"id":"loadOne6560","key":"loadOne6560","value":"1-00103ccae11cc9f10000000000000000"},
18167d18152
< {"id":"loadOne8347","key":"loadOne8347","value":"1-00103cccb086e4b60000000000000000"},
18265d18249
< {"id":"loadOne8435","key":"loadOne8435","value":"1-00103cccc41fa2e30000000000000000"},

3. Logs
    -------
cbcollect + data files - https://s3.amazonaws.com/bugdb/jira/MB-11725/11725.tar
Comment by Aleksey Kondratenko [ 16/Jul/14 ]
For loadOne6212 (vbucket 854) I see no traces of it ever being received by xdcr via upr.

Also notably loadOne10586 is actually present on all nodes. So might be another upr bug affecting views (i.e. my understanding is that mismatch was detected using views).
Comment by Mike Wiederhold [ 16/Jul/14 ]
Appears to be an issue with a backfill not reading items from disk properly. I'm creating a toy build to get more information.
Comment by Mike Wiederhold [ 16/Jul/14 ]
Can you please try to reproduce the issue with the toy build linked below. It contains a possible fix.

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-3.0.0-toy-mikewied-x86_64_3.0.0-716-toy.rpm
Comment by Sriram Melkote [ 17/Jul/14 ]
Also see MB-11672 which Mike indicates is a duplicate of this bug
Comment by Sriram Melkote [ 17/Jul/14 ]
Mike, on MB-11672, Abhinav had noted:

"So when the request came in for stream from 0 to 243, and ep-engine sent back a disk snapshot from 0 to 0, it' when there are no items on disk.
For backfill to run, we are supposed to wait for all those items to persist to disk, which doesn't seem to be in this case. I'll need to figure out why."

Is this the case in this bug as well?
Comment by Mike Wiederhold [ 17/Jul/14 ]
Yes.
Comment by Aruna Piravi [ 17/Jul/14 ]
The fix that went into toybuild did not help. Pls see below -


2014-07-17 13:57:42 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: curr_items 39997 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket
2014-07-17 13:57:42 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_active_curr_items 39997 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket
2014-07-17 13:57:42 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 39997 == 40000 expected on '10.3.4.188:8091''10.3.4.189:8091', default bucket

Live cluster available for 2 hrs - http://10.3.4.186:8091/index.html . Thanks.
Comment by Sangharsh Agarwal [ 18/Jul/14 ]
Same issue observed with rebalance-in operation too, build 973 too:
https://friendpaste.com/24RBXFi7oLt0yDHSrLYrO7

Live Cluster is available for debug

[Source]
10.3.3.144
10.3.3.146
10.3.3.147

[Destination]
10.3.3.142
10.3.3.143
10.3.3.145
10.3.3.148



[Errors]
2014-07-18 04:06:30 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: curr_items 139994 == 140000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
2014-07-18 04:06:30 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_active_curr_items 139994 == 140000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
2014-07-18 04:06:31 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 139994 == 140000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
2014-07-18 04:06:35 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: curr_items 139994 == 140000 expected on '10.3.3.146:8091''10.3.3.144:8091''10.3.3.147:8091', default bucket
2014-07-18 04:06:35 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_active_curr_items 139994 == 140000 expected on

[Meta data verfication found missing keys]

2014-07-18 04:10:42 | ERROR | MainProcess | test_thread | [xdcrbasetests._verify_revIds] 6 keys not found on 10.3.3.146:[('key: LoadTwo68769', 'vbucket: 1016'), ('key: LoadTwo58159', 'vbucket: 1016'), ('key: LoadTwo57898', 'vbucket: 1016'), ('key: LoadTwo57908', 'vbucket: 1016'), ('key: LoadTwo52989', 'vbucket: 1016'), ('key: LoadTwo52819', 'vbucket: 1016')]
Comment by Aruna Piravi [ 18/Jul/14 ]
If Mike thinks 11757 is a duplicate of this problem, and this is an upr-backfill problem, then it may not be restricted to capi mode alone. Also 11757 was in xmem. I think Alk added "capi mode" to add extra logging. Hence removing it now. Pls correct if I'm wrong.
Comment by Aruna Piravi [ 18/Jul/14 ]
Yesterday's updates :

 Mike provided another toy-build. Ran twice with info logging and twice again without info logging. Could not reproduce the bug. Apparently the only difference between the two toy-builds was an extra line of logging. And that's executed only with info-level logging. So there's still no clarity on why the problem is so inconsistent. Mike is working on a fix.
Comment by Mike Wiederhold [ 21/Jul/14 ]
I cannot reproduce the issue with the test case that originally caused this problem. There are some duplicate bugs filed that also show the same symptoms of this issue and I'll work with QE to try to reproduce the issue with those tests.
Comment by Mike Wiederhold [ 22/Jul/14 ]
Assigning back to QE for now since I'm waiting for this issue to be reproduced again.




[MB-11824] [system test] [kv unix] rebalance hang at 0% when add a node to cluster Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.4 64-bit

Attachments: Zip Archive 172.23.107.195-7252014-1342-diag.zip     Zip Archive 172.23.107.196-7252014-1345-diag.zip     Zip Archive 172.23.107.197-7252014-1349-diag.zip     Zip Archive 172.23.107.199-7252014-1352-diag.zip     Zip Archive 172.23.107.200-7252014-1356-diag.zip     Zip Archive 172.23.107.201-7252014-143-diag.zip     Zip Archive 172.23.107.202-7252014-1359-diag.zip     Zip Archive 172.23.107.203-7252014-146-diag.zip    
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.rpm.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on 8 nodes
1:172.23.107.195
2:172.23.107.196
3:172.23.107.197
4:172.23.107.199
5:172.23.107.200
6:172.23.107.202
7:172.23.107.201

8:172.23.107.203

Create a cluster of 7 nodes
Create 2 buckets: default and sasl-2 (no view)
Load 25+ M items to each bucket to bring down active resident ratio down to 80%
Do update, expired and delete on both buckets in 3 hours.
Then add node 203 to cluster. Rebalance hang at 0%

Live cluster is available to debug


 Comments   
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809. We currently have two bug fixes in that fix rebalance stuck issues. (MB-11809 and MB-11786. Please run the tests with these changes merged before filing any other rebalance stuck issues.




[MB-8207] moxi does not allow a noop before authentication on binary protocol Created: 07/May/13  Updated: 25/Jul/14  Due: 20/Jun/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
We had added a feature to spymemcached based on a Couchbase user's request to detect hung processes. This tries to complete a noop before doing auth.

It's fine against memcached/ep-engine in all cases, and it appears to be fine for ascii (where there is no authentication and it fails back to the version command), but moxi does not seem to allow auth after the noop. This may be because it's expecting the first command to wire it up to a downstream for the "gateway" moxi?

We're going to search for a workaround, but I wanted to make sure this issue was known.

See also:
https://code.google.com/p/spymemcached/issues/detail?id=272&thanks=272&ts=1364702110
and
https://github.com/mumoshu/play2-memcached/issues/17

 Comments   
Comment by Maria McDuff (Inactive) [ 08/May/13 ]
per bug triage, assigning to ronnie.
ronnie --- pls take a look. thanks.
Comment by Maria McDuff (Inactive) [ 08/Oct/13 ]
ronnie,

can you please update this bug based on 2.2.0 build 821?
thanks.
Comment by Maria McDuff (Inactive) [ 19/May/14 ]
Iryna,

can you verify in 3.0? if not resolved pls assign to Steve Y. Thanks.
Comment by Steve Yen [ 25/Jul/14 ]
Scrubbing through ancient moxi issues.

From the cproxy_protocol_b.c code, I see that moxi should be able to handle only VERSION and QUIT commands before doing an AUTH.

Rather than changing the "stabilized" moxi codebase, I'm marking this Won't Fix on the hope that our lithe, warm-blooded and fast-moving SDK's might be able to maneuver faster than moxi.




[MB-11665] {UPR}: During a 2 node rebalance-in scenario :: Java SDK (1.4.2) usage sees a drop in OPS (get/set) by 50% and high error rate Created: 07/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: clients
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Parag Agarwal Assignee: Wei-Li Liu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 172.23.107.174-177

Triage: Untriaged
Operating System: Centos 64-bit
Flagged:
Release Note
Is this a Regression?: Yes

 Description   
We have compared the run of 2.5.1-0194 vs 3.0.0-918, Java SDK used 1.4.2

Common Scenario

1. Create a 2 node cluster
2. Create 1 default bucket
3. Add 15 K items with do get and set
4. Add 2 nodes and then rebalance
5. Run Get and Set again in parallel to rebalance

Issue observed during Step5: Ops drop by 50% , error rate is high for most of the times, when compared to 2.5.1

The comparative report is shared here

General Comparison Summary

https://docs.google.com/document/d/1PjQBdJvLFaK85OrrYzxOaZ54fklTXibj6yKVrrU-AOs/edit

3.0.0-918:: http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-3.0.0-918/Rb2In-HYBRID/07-03-14/068545/22bcef05a4f12ef3f9e7f69edcfc6aa4-MC.html

2.5.1-1094: http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-2.5.1-1094/Rb2In-HYBRID/06-24-14/083822/2f416c3207cf6c435ae631ae37da4861-MC.html
Attaching logs

We are trying to run more tests for different version of SDK like 1.4.3, 2.0

https://s3.amazonaws.com/bugdb/jira/MB-11665/logs_3_0_0_918_SDK_142.tar.gz


 Comments   
Comment by Parag Agarwal [ 07/Jul/14 ]
Pavel: Please add your comments for such a scenario with libcouchbase
Comment by Pavel Paulau [ 07/Jul/14 ]
Not exactly the same scenario but I'm not seeing major drops/errors in my tests (using lcb based workload generator).
Comment by Parag Agarwal [ 08/Jul/14 ]
So Deepti posted results and we are not seeing issues with 1.4.3 for the same run. What is the difference between SDK 1.4.2 Vs 1.4.3
Comment by Aleksey Kondratenko [ 08/Jul/14 ]
Given that problem seems to be sdk version specific and there's no evidence yet that it's something ns_server may cause, I'm bouncing this ticket back.
Comment by Matt Ingenthron [ 08/Jul/14 ]
Check the release notes for 1.4.3. We had an issue where there would be authentication problems, including timeouts and problems with retries. This was introduced in changes in 1.4.0 and fixed in 1.4.3. There's no direct evidence, but that sounds like a likely cause.
Comment by Matt Ingenthron [ 08/Jul/14 ]
Parag: not sure why you assigned this to me. I don't think there's any action for me. Reassigning back. I was just giving you additional information.
Comment by Wei-Li Liu [ 08/Jul/14 ]
Re-run the test with 1.4.2 SDK against 3.0.0 Server with just 4GB RAM per node ( comparing to my initial test with 16GB RAM per node)
The test result is much better. Not seeing the errors and operations rate never drop significantly
http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-3.0.0-918/Rb2In-HYBRID/07-08-14/074980/d5e2508529f1ad565ee38c9b8ab0c75b-MC.html
 
Comment by Parag Agarwal [ 08/Jul/14 ]
Sorry, Matt ! Should we close this as a documentation for release notes?
Comment by Matt Ingenthron [ 08/Jul/14 ]
Given that we believe it's an issue in a different project (JCBC), fixed and release noted there, I think we can just close this. The only other possible action, up to you and your management, is trying to verify this is the actual cause a bit more thoroughly.
Comment by Mike Wiederhold [ 25/Jul/14 ]
I haven't seen any activity on this in weeks and given that the last test results look good. I'm going to mark it as fixed. Please re-open if something still needs to be done for this ticket.




[MB-11822] numWorkers setting of 5 is treated as high priority but should be treated as low priority. Created: 25/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Venu Uppalapati Assignee: Sundar Sridharan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: No

 Description   
https://github.com/couchbase/ep-engine/blob/master/src/workload.h#L44-48
we currently use the priority conversion formula as seen in above code snippet
this assign numWorkers setting of 5 high priority but the expectation is that <=5 is low priority.

 Comments   
Comment by Sundar Sridharan [ 25/Jul/14 ]
fix uploaded for review at http://review.couchbase.org/39891 thanks




[MB-9013] Moxy server restart exiting with code 139 Created: 30/Aug/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.0.1
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Luca Mazzocchi Assignee: Steve Yen
Resolution: Incomplete Votes: 0
Labels: restart
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: VMWare 5.1 Update 2
Centos
2 CPU
8 GB

Cluster with 2 node. We are using Memcached bucket


 Description   
Periodically (every 2 ours) we have this message

Port server moxi on node 'ns_1@couch-ep-1.eprice.lan' exited with status 139. Restarting. Messages: 2013-08-30 15:34:03: (cproxy_config.c.317) env: MOXI_SASL_PLAIN_USR (13)
2013-08-30 15:34:03: (cproxy_config.c.326) env: MOXI_SASL_PLAIN_PWD (12)

alternatively on couch-ep-1 and @couch-ep-2

The hit ratio of the memcached drops and the client (an ecommerce site) logs messagges like "connection refused"


 Comments   
Comment by Maria McDuff (Inactive) [ 30/Aug/13 ]
anil,
pls decide what release this should go.
Comment by Steve Yen [ 25/Jul/14 ]
(scrubbing through ancient moxi bugs on the path to 3.0)

Not sure what the exact cause of the 139 (sigsegv) was back then, but there was at least one crash fix after moxi 2.0.1 -- see MB-8102.

In the hope that that was the cause, marking this report/issue as incomplete.




[MB-9874] [Windows] Couchstore drop and reopen of file handle fails Created: 09/Jan/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0.1, 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: Windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows


 Description   
The unit test doing couchstore_drop_file and couchstore_repoen_file fails due to COUCHSTORE_READ_ERROR when it tries to reopen the file.

The commit http://review.couchbase.org/#/c/31767/ disabled the test to allow the rest of the unit tests to be executed.

 Comments   
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-10292] [windows] assertion failure in test_file_sort Created: 24/Feb/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Trond Norbye Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
assertion on line 263 fails: assert(ret == FILE_SORTER_SUCCESS);

ret == FILE_SORTER_ERROR_DELETE_FILE

 Comments   
Comment by Trond Norbye [ 27/Feb/14 ]
I've disabled the test for win32 with http://review.couchbase.org/#/c/33985/ to allow us to find other regressions..
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Chiyoung, Anil, Venu, Wayne .. July 17th




[MB-8527] Moxi honors http_proxy environment variable Created: 27/Jun/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Bill Nash Assignee: Steve Yen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit

 Description   

When set, Moxi honors http_proxy when attempting to connect to 127.0.0.1:8091.

In the example below, 10.12.78.99 is the ip address of my http proxy, which I enabled to download and upgrade to CB 2.1.0. As it was still set at cluster start time, Moxi began attempting to honor it, consequently blocking all read and write attempts, even though the cluster was otherwise indicated to be healthy.

[pid 27410] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 45
[pid 27410] setsockopt(45, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 27410] fcntl(45, F_GETFL) = 0x2 (flags O_RDWR)
[pid 27410] fcntl(45, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 27410] connect(45, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("10.12.78.99")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 27410] poll([{fd=45, events=POLLOUT}], 1, 1000) = 1 ([{fd=45, revents=POLLOUT}])
[pid 27410] getsockopt(45, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 27410] getpeername(45, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("10.12.78.99")}, [16]) = 0
[pid 27410] getsockname(45, {sa_family=AF_INET, sin_port=htons(53608), sin_addr=inet_addr("10.12.54.42")}, [16]) = 0
[pid 27410] sendto(45, "GET http://127.0.0.1:8091/pools/"..., 171, MSG_NOSIGNAL, NULL, 0) = 171
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=45, revents=POLLIN}])
[pid 27410] poll([{fd=45, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=45, revents=POLLIN}])
[pid 27410] recvfrom(45, "HTTP/1.0 504 Gateway Time-out\r\nS"..., 16384, 0, NULL, NULL) = 1616

Issuing 'unset http_proxy' and restarting the cluster / killing moxi corrects the issue.

The error is mentioned in the babysitter logs:
babysitter.1:[ns_server:info,2013-06-27T11:38:19.083,babysitter_of_ns_1@127.0.0.1:<0.91.0>:ns_port_server:log:168]{moxi,"Atlas"}<0.91.0>: 2013-06-27 11:38:20: (agent_config.c.423) ERROR: parse JSON failed, from REST server: http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas, <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><head> <meta http-equiv="Content-Type" CONTENT="text/html; charset=utf-8"> <title>ERROR: The requested URL could not be retrieved</title> <style type="text/css"><!-- %l body :lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; } :lang(he) { direction: rtl; float: right; } --></style> </head><body> <div id="titles"> <h1>ERROR</h1> <h2>The requested URL could not be retrieved</h2> </div> <hr> <div id="content"> <p>The following error was encountered while trying to retrieve the URL: <a href="http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas">http://127.0.0.1:8091/pools/default/bucketsStreaming/Atlas&lt;/a&gt;&lt;/p> <blockquote id="error"> <p><b>Connection to 127.0.0.1 failed.</b></p> </blockquote>

I would suggest modifying moxi to never honor proxy settings, or to explicitly unset them at server start time.

 Comments   
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Steve,

is this an easy fix?
Comment by Steve Yen [ 25/Jul/14 ]
Going through ancient moxi issues.

I'm worried about changing this behavior as some users might actually be depending on moxi's current http proxy behavior, especially those perhaps doing standalone moxi as opposed to the moxi packaged in couchbase.




[MB-8601] Log is not self-descriptive when Moxi crashes due to not having vbucket map Created: 12/Jul/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: moxi
Affects Version/s: 2.1.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Larry Liu Assignee: Steve Yen
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged

 Description   
http://www.couchbase.com/issues/browse/MB-8431

This bug is closed as won't fix. But the log message seems not self-discriptive and can be mis-interpreted as moxi crashing. It makes to do the following:

1. Moxi should not crash while waiting for vbucket map.
2. If moxi crashes due to vbucket map, the log should be clear. User will not be panic.





 Comments   
Comment by Maria McDuff (Inactive) [ 20/May/14 ]
Steve,

this will be very helpful for 3.0.
Comment by Steve Yen [ 25/Jul/14 ]
http://review.couchbase.org/39895
Comment by Steve Yen [ 25/Jul/14 ]
http://review.couchbase.org/39897




[MB-11816] coucbase-cli failed to collect log in cluster-wide collection Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Ubuntu 12.04 64-bit

Triage: Triaged
Operating System: Ubuntu 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_3.0.0-1022-rel.deb.manifest.xml
Is this a Regression?: Yes

 Description   
Install couchbase server 3.0.0-1022 on one ubuntu 12.04 node
Run cluster-wide collectinfo using couchbase-cli
Failed to collect

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c localhost:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 127.0.0.1:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --all-nodes
ERROR: option --all-nodes not recognized

 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39848/
Comment by Thuan Nguyen [ 25/Jul/14 ]
Verified on build 3.0.0-1028. This bug was fixed.




[MB-10814] tap based make simple test is failing (on kingstar) Created: 09/Apr/14  Updated: 25/Jul/14  Resolved: 07/May/14

Status: Closed
Project: Couchbase Server
Component/s: build, couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aleksey Kondratenko Assignee: Ketaki Gangal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Something recently caused make simple-test to start failing. Not sure what (could be my team) but worth raising alarm

 Comments   
Comment by Aleksey Kondratenko [ 09/Apr/14 ]
Here's what I get:

# make simple-test
scripts/start_cluster_and_run_tests.sh b/resources/dev-4-nodes-xdcr.ini conf/simple.conf 0
~/src/altoros/moxi/repo3/testrunner ~/src/altoros/moxi/repo3/testrunner
~/src/altoros/moxi/repo3/testrunner
rebalance_in_with_ops (rebalance.rebalancein.RebalanceInTests) ... ERROR


And observing cluster I'm not able to see any obvious errors. Given that this thing is not telling me exactly what is broken I have no idea if that's something ns_server-side that's broken or elsewhere (or even tests themselves are broken).
Comment by Aleksey Kondratenko [ 09/Apr/14 ]
CC-ed lots of folks given that we're at apparent emergency given our argreed on protocols.
Comment by Maria McDuff (Inactive) [ 09/Apr/14 ]
make-simple test bug re-opened yesterday - (still open) https://www.couchbase.com/issues/browse/MB-10780
Comment by Aleksey Kondratenko [ 09/Apr/14 ]
This one fails at RebalanceInTests phase. My guess is that it's prior to any views at all.
Comment by Sriram Melkote [ 10/Apr/14 ]
Wayne, will you back out changes that broke this? Mike will independently try backing out some changes to see if they MAY cause the issue. But we don't have a specific cause identified so far.
Comment by Aleksey Kondratenko [ 10/Apr/14 ]
Mike helped me to look deeper. We found that tap replication is not working. There's slight items mismatch that's causing first test to fail.

And Mike said he'll take a look
Comment by Mike Wiederhold [ 11/Apr/14 ]
I'm going to leave this open until every problem I see is resolved, but we should now only be seeing a sporadic failure on the employee dataset test as well as an issue I need to look at the only seems to be happening on Alk's machine.
Comment by Aleksey Kondratenko [ 11/Apr/14 ]
>> only seems to be happening on Alk's machine.

Correction even with latest code it happens on the following machines:

* beta (my desktop box). Both i386 and amd64

* chi (my laptop)

* kingstar

All are running more or less recent GNU/Linux Debian Sid amd64 (with exception of beta running i386). Particularly all boxes have fairly recent version of gcc.

Comment by Aliaksey Artamonau [ 11/Apr/14 ]
Just ran make simple-test on my laptop and saw same kind of failures.
Comment by Mike Wiederhold [ 15/Apr/14 ]
I've updated the title of this bug. At the moment this only seems to affect kingstar and Alk's machines. I will investigate it soon, but want to be clear that this doesn't seem to be affecting a large group of people.
Comment by Abhinav Dangeti [ 25/Apr/14 ]
Aliaksey, we've merged some code regarding scheduling certain kinds of tasks in ep-engine, which seems to have fixed this issue on kingstar, make simple-test is passing as well, please verify on your machine too, and if you're okay with it, mark this as resolved, thanks.
Comment by Aliaksey Artamonau [ 26/Apr/14 ]
Cannot verify unfortunately. The test is failing differently:

2014-04-26 12:30:22 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:23 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:24 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:25 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:26 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:27 | ERROR | MainProcess | MainThread | [rest_client._http_request] socket error while connecting to http://127.0.0.1:9000/nodes/self error [Errno 111] Connection refused
2014-04-26 12:30:28 | ERROR | MainProcess | MainThread | [rest_client._http_request] http://127.0.0.1:9000/pools/default/ error 404 reason: unknown "unknown pool"
2014-04-26 12:30:28 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 127.0.0.1
2014-04-26 12:30:28 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 127.0.0.1

Comment by Aliaksey Artamonau [ 07/May/14 ]
The test magically started passing on my machine too.
Comment by Ketaki Gangal [ 08/May/14 ]
I dont see it on any of my environments for qe-sanity, make-simple-github

From the above comments, it looks like an env/machine specific failure.

Please let me know why is this assigned to me and what am I expected to do for this bug.

Comment by Ketaki Gangal [ 25/Jul/14 ]
closing this - since i dont see these failures. if relevant please open another bug for this- since this is a much older ticket specific to a particular env afaik,




[MB-11409] view.viewmergetests.ViewMergingTests.test_stats_error test fails due to referencing invalid object Created: 12/Jun/14  Updated: 25/Jul/14  Resolved: 16/Jun/14

Status: Closed
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sarath Lakshman Assignee: Ketaki Gangal
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
======================================================================
ERROR: test_stats_error (view.viewmergetests.ViewMergingTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/view/viewmergetests.py", line 48, in setUp
    self.fail(ex)
AssertionError: 'NoneType' object has no attribute 'vbuckets'

----------------------------------------------------------------------
Ran 1 test in 8.119s

FAILED (errors=1)
summary so far suite view.viewmergetests.ViewMergingTests , pass 0 , fail 1
failures so far...
view.viewmergetests.ViewMergingTests.test_stats_error
testrunner logs, diags and results are available under /Users/sarath/development/couchbase/testrunner/logs/testrunner-14-Jun-12_15-28-51/test_1
test fails, all of the following tests will be skipped!!!
Run after suite setup for view.viewmergetests.ViewMergingTests.test_stats_error
view.viewmergetests.ViewMergingTests.test_stats_error fail

 Comments   
Comment by Meenakshi Goel [ 12/Jun/14 ]
This error comes when we run this test independently as it skips the Setup, for that we need to set "first_case=true" if has to run alone otherwise in conf file it's done in only 1st test case.

However test is currently failing with below error which is coming after change to add Status Code to exception http://review.couchbase.org/#/c/37714/3

AssertionError: 'Status 500.Error occured querying view redview_stats: {"error":"error","reason":"Reducer: Error building index for view `redview_stats`, reason: Value is not a number (key \\"1\\")"}' != 'Error occured querying view redview_stats: {"error":"error","reason":"Reducer: Error building index for view `redview_stats`, reason: Value is not a number (key \\"1\\")"}'

Test case needs to be updated to add Status code "Status 500." in expected string.
Comment by Sarath Lakshman [ 16/Jun/14 ]
Duplicate of CBQE-1350




[MB-11064] Increase default memcached connection limit Created: 07/May/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.5.1, 3.0
Fix Version/s: feature-backlog, 3.0
Security Level: Public

Type: Improvement Priority: Major
Reporter: Perry Krug Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
Relates to
relates to MB-11066 Allow for dynamic change of the numbe... Reopened
relates to MB-11008 What is the maximum number of couchba... Resolved

 Description   
Currently the default max connection limit for memcached is set to 10k (with 1k reserved for internal traffic).

With much larger clusters as well as greatly increased numbers of client objects being deployed, I think it prudent to raise this limit. We've successfully done so when required at a number of customers already.

My recommendation would be for the limit to be 35k with 5k reserved for internal traffic.

 Comments   
Comment by Aleksey Kondratenko [ 07/May/14 ]
I believe this needs at least ack from somebody who does memcached or even from memcached _and_ ep-engine leads.
Comment by Aleksey Kondratenko [ 07/May/14 ]
Also I personally feel that 30k is not bold enough. 100k (10x raise) looks like better idea to me.
Comment by Perry Krug [ 07/May/14 ]
Adding Chiyoung and Trond.

100k sounds a bit too bold to me. The idea, in my mind at least, is to allow proper usage by larger applications to work better out of the box, while still providing a ceiling to prevent runaway processes from putting the cluster in trouble.
Comment by Trond Norbye [ 07/May/14 ]
In the current implementation we allocate 50% of the connections structures up front and the size of each connection is 13480 bytes (including the read/write buffers etc). If we're going to change the default that high we must also change this logic to be more lazy to avoid wasting too much memory for connection objects in the common case.
Comment by Aleksey Kondratenko [ 07/May/14 ]
So I assume that this is not ack. But instead "what a bit while memcached is ready".
Comment by Perry Krug [ 07/May/14 ]
Trond, is it too simple to ask for the code to be the lower of either 50% or 5k connection structures up front?

I certainly agree with your assertion that the logic needs to change but I'd also like to make sure this change doesn't get lost in circular discussions since we are out in the field raising this value anyway...

Comment by Trond Norbye [ 07/May/14 ]
Everything is ready from within ep-engine and memcached. Note that even if its _possible_ to use 100k connections they would in the current state then consume a fair amount of memory (even when they're idle) due to the fixed sized buffers pinned to each connection.

I'll create a new bug to be able to dynamically increase/decrease the maximum number of connections per endpoint and another bug to reduce the amount of memory being consumed per connection when it is in an idle state.
Comment by Perry Krug [ 07/May/14 ]
Awesome Trond, thanks.

Alk, is that a good enough ack? For 30k to 11210 and 5k to 11209?
Comment by Aleksey Kondratenko [ 08/May/14 ]
It has to be set in initscript first. Which is in voltron. And I'm not eager to touch it (not even sure what branch etc).

After it's set in initscript I'll update default config to increase "port 11209" limit.
Comment by Matt Ingenthron [ 12/Jun/14 ]
Wayne: what's the right way to open a request for adding test coverage against this? One area of concern is that client/application behavior may be confusing depending on how the reserved '1000' or reserved '5000' connections are handled.
Comment by Patrick Varley [ 21/Jul/14 ]
Support have KB for this issue: http://support.couchbase.com/entries/27987920-Memcached-wrapper-Increase-ulimit-n-and-ulimit-c-
Comment by Patrick Varley [ 21/Jul/14 ]
I have added the list of customers who have increased the limit, I have also added a link to MB-11066 (Allow for dynamic change of the number of connections) which Trond mentioned earlier.
Comment by Chris Hillery [ 24/Jul/14 ]
Alk - what is "initscript"? I don't immediately see anything in voltron that seems related to this.

What configuration needs to be changed? Is this an OS-level configuration that needs to be done?
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
https://github.com/couchbase/voltron/blob/master/server-overlay-rpm/etc/couchbase_init.d.tmpl#L47

and there's similar thing for .deb initscript
Comment by Chris Hillery [ 24/Jul/14 ]
Ah, you're just talking about the overall file descriptors limit.

I can make that change; what should it be changed to?
Comment by Chris Hillery [ 24/Jul/14 ]
As an aside: I'm not sure how to make the corresponding change for Mac or Windows.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
windows doesn't need it. OSX is is a dev only.

Set it to 40k.
Comment by Chris Hillery [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39836/

This is for the master branch. After it is reviewed and merged, I'll cherry-pick it to 3.0.0.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Still don't see ulimit bumped as of 3.0.0 build 1026
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
merged ns_server's change (http://review.couchbase.org/39888). But given that 3.0.0 voltron apparently still doesn't have upgraded ulimit I'm assigning it to Chris.




[MB-11823] buildbot for ubuntu 10.04 looks like hang Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Thuan Nguyen Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
http://builds.hq.northscale.net:8010/builders/ubuntu-1004-x64-300-builder
I saw build job pending more than 13+ hours




[MB-11818] couchbase cli in cluster-wide collectinfo failed to start to collect selected nodes Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: tools
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: ubuntu 12.04 64-bit

Triage: Untriaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Unknown

 Description   
Install couchbase server 3.0.0-1022 on 4 nodes
Run couchbase cli to do cluster-wide collectinfo on one node
The collection failed to start

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.148
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-stop -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149


 Comments   
Comment by Bin Cui [ 25/Jul/14 ]
I am confused. Are you sure you want to use collect-logs-stop to start collecting ?
Comment by Thuan Nguyen [ 25/Jul/14 ]
Oop I copy the wrong command
Here is command failed to start collectinfo

root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=ns_1@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
root@ubuntu:~# /opt/couchbase/bin/couchbase-cli collect-logs-start -c 192.168.171.148:8091 -u Administrator -p password --nodes=@192.168.171.149
NODES: ERROR: command: collect-logs-start: 192.168.171.148:8091, global name 'nodes' is not defined
Comment by Bin Cui [ 25/Jul/14 ]
http://review.couchbase.org/#/c/39889/




[MB-11800] cbworkloadgen failed to run in rhel 6.5 Created: 23/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.0.1, 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Cédric Delgehier Assignee: Anil Kumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Red Hat Enterprise Linux Server release 6.5 (Santiago)
kernel 2.6.32-431.20.3.el6.x86_64

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
After installing Couchbase,

I tried a cbworkloadgen, but I get an error :

{noformat}
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen --version
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
[root@rhel65_64~]# /opt/couchbase/lib/python/cbworkloadgen -n localhost:8091
Traceback (most recent call last):
  File "/opt/couchbase/lib/python/couchstore.py", line 29, in <module>
    _lib = CDLL("libcouchstore-1.dll")
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory
{noformat}

Versions tested:
couchbase-server-2.0.1-170.x86_64
couchbase-server-2.5.1-1083.x86_64

 Comments   
Comment by Bin Cui [ 23/Jul/14 ]
First, please check if libcouchstore.so is under /opt/couchbase/lib. If yes, please check if the following python script can run correctly

import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)
Comment by Bin Cui [ 23/Jul/14 ]
The problem is possibly caused by wrong permission for ctypes module.

http://review.couchbase.org/#/c/39764/
Comment by Cédric Delgehier [ 23/Jul/14 ]
[root@rhel65_64 ~]# ls -al /opt/couchbase/lib/libcouchstore.so
lrwxrwxrwx 1 bin bin 22 Jul 22 14:51 /opt/couchbase/lib/libcouchstore.so -> libcouchstore.so.1.0.0

---

[root@rhel65_64 ~]# cat test.py
#!/usr/bin/env python
# -*-python-*-

import traceback, sys
import ctypes
for lib in ('libcouchstore.so', # Linux
            'libcouchstore.dylib', # Mac OS
            'couchstore.dll', # Windows
            'libcouchstore-1.dll'): # Windows (pre-CMake)
    try:
        _lib = ctypes.CDLL(lib)
        break
    except OSError, err:
        continue
else:
    traceback.print_exc()
    sys.exit(1)

[root@rhel65_64 ~]# python test.py
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    _lib = ctypes.CDLL(lib)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcouchstore-1.dll: cannot open shared object file: No such file or directory

---

[root@rhel65_64 ~]# python -c "import sys; print sys.version_info[1]"
6

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysqlite2
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2

[root@rhel65_64~]# ls -al /opt/couchbase/lib/python/pysqlite2/*
-rw----r-- 1 1001 1001 2624 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dbapi2.py
-rw------- 1 root root 2684 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/dbapi2.pyc
-rw----r-- 1 1001 1001 2350 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/dump.py
-rw----r-- 1 1001 1001 1020 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/__init__.py
-rw------- 1 root root 134 Jul 23 11:02 /opt/couchbase/lib/python/pysqlite2/__init__.pyc
-rwx---r-- 1 1001 1001 1253220 Jul 22 14:52 /opt/couchbase/lib/python/pysqlite2/_sqlite.so

/opt/couchbase/lib/python/pysqlite2/test:
total 120
drwx---r-- 3 1001 1001 4096 Jul 22 14:52 .
drwx---r-x 3 1001 1001 4096 Jul 23 11:02 ..
-rw----r-- 1 1001 1001 29886 Jul 22 14:52 dbapi.py
-rw----r-- 1 1001 1001 1753 Jul 22 14:52 dump.py
-rw----r-- 1 1001 1001 7942 Jul 22 14:52 factory.py
-rw----r-- 1 1001 1001 6569 Jul 22 14:52 hooks.py
-rw----r-- 1 1001 1001 1966 Jul 22 14:52 __init__.py
drwx---r-- 2 1001 1001 4096 Jul 22 14:52 py25
-rw----r-- 1 1001 1001 10443 Jul 22 14:52 regression.py
-rw----r-- 1 1001 1001 7356 Jul 22 14:52 transactions.py
-rw----r-- 1 1001 1001 15200 Jul 22 14:52 types.py
-rw----r-- 1 1001 1001 13217 Jul 22 14:52 userfunctions.py

---

[root@rhel65_64~]# ls -ald /opt/couchbase/lib/python/pysnappy2_24
ls: cannot access /opt/couchbase/lib/python/pysnappy2_24: No such file or directory
[root@rhel65_64~]# locate pysnappy
[root@rhel65_64~]#

---

As an indication, for version 4:

[root@rhel65_64~]# ls -al /usr/lib64/python2.6/lib-dynload/_ctypes.so
-rwxr-xr-x 1 root root 123608 Nov 21 2013 /usr/lib64/python2.6/lib-dynload/_ctypes.so
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/
drwxr-xr-x. 3 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/
[root@rhel65_64~]# ls -ald /usr/lib64/python2.6/ctypes/*
-rw-r--r-- 1 root root 2041 Nov 22 2010 /usr/lib64/python2.6/ctypes/_endian.py
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyc
-rw-r--r-- 2 root root 2286 Nov 21 2013 /usr/lib64/python2.6/ctypes/_endian.pyo
-rw-r--r-- 1 root root 17004 Nov 22 2010 /usr/lib64/python2.6/ctypes/__init__.py
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyc
-rw-r--r-- 2 root root 19936 Nov 21 2013 /usr/lib64/python2.6/ctypes/__init__.pyo
drwxr-xr-x. 2 root root 4096 Jul 9 19:52 /usr/lib64/python2.6/ctypes/macholib
-rw-r--r-- 1 root root 8531 Nov 22 2010 /usr/lib64/python2.6/ctypes/util.py
-rw-r--r-- 1 root root 8376 Mar 20 2010 /usr/lib64/python2.6/ctypes/util.py.binutils-no-dep
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyc
-rw-r--r-- 2 root root 7493 Nov 21 2013 /usr/lib64/python2.6/ctypes/util.pyo
-rw-r--r-- 1 root root 5349 Nov 22 2010 /usr/lib64/python2.6/ctypes/wintypes.py
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyc
-rw-r--r-- 2 root root 5959 Nov 21 2013 /usr/lib64/python2.6/ctypes/wintypes.pyo



Comment by Bin Cui [ 24/Jul/14 ]
Check if we support rhel 6.5 or not
Comment by Cédric Delgehier [ 24/Jul/14 ]
http://docs.couchbase.com/couchbase-manual-2.5/cb-install/#supported-platforms
Comment by Cédric Delgehier [ 25/Jul/14 ]
So if I understand the implied, you tell me to do a rollback of the security patches until version 6.3, is that it?




[MB-7432] XDCR Stats enhancements Created: 17/Dec/12  Updated: 25/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: cross-datacenter-replication, UI
Affects Version/s: 2.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Improvement Priority: Major
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
relates to MB-9218 Incoming XDCR mutations don't match o... Closed

 Description   
After seeing XDCR in action, would like to propose a few enhancements:

-Put certain statistics in the XDCR screen as well as on the graph page:
    -Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
    -Items per second to see speed of each stream and in total
    -Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total

-On the graph page of outgoing, I would recommend removing "mutations checked", "mutations replicated", "data replication", "active vb reps", "waiting vb reps", "secs in replicating", "secs in checkpointing", "checkpoints issued" and "checkpoints failed". These stats really aren't useful from the perspective of someone trying to monitor or troubleshoot the current state of their cluster.
-On the graph page of outbound, there's a bit of confusion over the difference between "mutations to replicate", "mutations in queue" and "queue size". Unless they are showing significantly (and usefully) different metrics, recommend to remove all but one
-On the graph page of incoming, recommend to put "total ops/sec" on the far left to line up with the "ops/sec" in the summary section
-"XDCR dest ops per sec" is confusing because this cluster is the "destination" yet the stat implies the other way around. Recommend "Incoming XDCR ops per sec"
-"XDCR docs to replicate" is a little confusing because it doesn't match the same stat in the "outbound". Recommend to change "mutations to replicate" to "XDCR docs to replicate"
-Would also be good to see outbound ops/sec in the summary section alongside the number remaining to replicate

 Comments   
Comment by Junyi Xie (Inactive) [ 18/Dec/12 ]
Perry,

I will certainly add the stats you suggested, and reorder some stats to make it more readable.


For current stats, they exist for some reasons, actually most of them are there because of request from QE and performance team, although apparently there are not quite interesting to users. If they do not cause big downside, I would like to keep them at this time.
Comment by Perry Krug [ 19/Dec/12 ]
Thanks Junyi. I'd actually like to continue the discussion about removing those stats because anything that a customer sees will generate a question as to the purpose...meaningful or not. We want the UI the be simple and direct to our users for the purpose of understanding what the cluster/node is doing...I don't think these 11 stats help accomplish that for our customers. Additionally, I think the ns_server team would agree that the overall less stats we have the better for performance and maintenance.

To be clear, I'm not advocating for these stats removed from the system completely, just from the UI.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Dipti,

Perry suggested removing some XDCR stats on UI and add some new stats. This is big change in XDCR UI and it woud be better that you are aware of this. Before going ahead and implement this, I would like to have your comments here if

1) Are these new stats necessary?

2) Are these old XDCR stats which Perry suggested to remove, still valid to some customers?

3) Which version do you want this change happens, say 2.0.1 (too late?), 2.1, or 3.0 etc.

Please add others whom you think should be aware of this.

Thanks.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Please see my comments.
Comment by Junyi Xie (Inactive) [ 10/Jan/13 ]
Ketaki and Abhinav,

Please also put your feedback about proposal Perry suggested. Thanks.
Comment by Ketaki Gangal [ 10/Jan/13 ]
Adding some more here

- Rate of Replication [items sent / sec]
- Average Replication Rate
- Lag in Replication ( Helpful to understand/observe If receiving too many back-offs/Timeouts)
          - Average Replication lag
- Items replicated
- Items to replicate
- Percentage Conflicts in Data

Other Useful ones
-------------------------------------
-one checkpoint every minute .
-back off handled by ns-server
-how many times retry
-timeouts - failed to replicate
-average replication lag
- XDCR data size
Comment by Ketaki Gangal [ 15/Jan/13 ]
Based on our discussion today, can we have the following changes/edits on the current XDCR stats.

1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total

2. On the Main bucket section
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR docs"
c. Add Percentage Complete
d. Add XDCR Replication Rate

3. On Outgoing XDCR section
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue"
d. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs"
d. Rename "queue size" as "XDCR queue size"
e. Edit "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints.
 
@Perry - Stats "secs in replicating" and "secs in checkpointing" have been useful in triaging xdcr bugs in the past.
Currently most of the xdc stats are aggregate at the ns_server, mnesia level. The individual( @ a vbucket level) logging is maintained at the log level. Considering the criticality of this stat, we ve decided to continue maintaining this information for xdc checkpointing.

Comment by Ketaki Gangal [ 15/Jan/13 ]
Of these , these stats are most critical


1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
Comment by Dipti Borkar [ 16/Jan/13 ]
Ketaki, sorry I couldn't attend the meeting today. I want some clarification on some of these before we implement. I'll sync up with you tomorrow.
Comment by Perry Krug [ 16/Jan/13 ]
Thank you Ketaki.

A few more comments:
-I don't know that "percentage complete" and "XDCR replication rate" is necessarily needed in the "main bucket section"...those are really specific to each stream below and may not make sense to aggregate together.
-Are we planning on keeping "mutation to replicate" and "XDCR docs to replicate" as separate stats?
-Along with above, what is the difference between (and do we need to keep all) "XDCR queue size", and "Outbound XDCR docs"?
-I still question the usefulness of the "secs in replicating" and "secs in checkpointing"...won't these values be constantly incrementing for the life of the replication stream? When looking at a customer's environment after running for days/weeks/months, what are these stats expected to show? Apologies if I'm not understanding them correctly...

Thanks
Comment by Ketaki Gangal [ 16/Jan/13 ]
@Dipti - Sure, lets sync up today on this.

@Perry -
c. Add Percentage Complete - yes, this is more pertinent at a replication stream level
d. Add XDCR Replication Rate - yes, this is more pertinent at a replication stream level

 Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs" , so they should be the same stats.
@Junyi - Correct me if this is a wrong assumption.

XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
Outbound XDCR docs : Is the total items that are to be replicated, not all of them are in-memory at all times.

For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.

Another way to do this would mean adding logging/some information elsewhere, but the current stats @ ns_server/xdcr level show these values on a per-vbucket basis which may/not essentially be very useful while triaging any errors of this kind.
We can however have a call to discuss more ,if there is a better way to implement this.

Comment by Perry Krug [ 16/Jan/13 ]
Thanks for continuing the conversation Ketaki. A few more follow ons from my side:

XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?

For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?

Thanks Ketaki. This is all good information for our documentation and internal information as well.

Perry
Comment by Ketaki Gangal [ 20/Jan/13 ]
Hi Perry,

[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
Xdcr queue size - is the total memory used for the xdcr queue per node. We want to account for memory overhead w/ xdcr(we only store key and metadata.)
This is the memory on the source node. It is accounted in the beam.smp memory.

For each vb replicator:
the queue is created with following limits
maximum number of items in the queue: BatchSize * NumWorkers * 2, by default, the batch size is 500, and NumWorkers is 4, so the queue can hold at most 4000 mutations
maximum size of queue: 100 * 1024 * NumWorkers, by default, it is 400KB
In short, the queue is bounded by 400KB or hold 4000 items, whichever is reached first.

On each node there is max 32 active replicators, so it is 32*400KB = 12800KB = 12.8MB maximum memory overhead used by the queue.

[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
For "secs in replicating" v/s "secs in checkpointing" I am not sure of the exact difference between the two.
@Junyi - Could you explain more here?
I should ve referred the "Docs to replicate" inplace of the "secs checkpointing" which lead to significant checkpoint changes in the past - my bad. This "http://www.couchbase.com/issues/browse/MB-6939" was the one I had in mind while referring to badness.

thanks,
Ketaki
Comment by Junyi Xie (Inactive) [ 21/Jan/13 ]
This bug will spawn a list of fixes. My tentative plan is to resolve this bug by several commits, based on all discussion above.

First of all, let me make clear that the "docs" (or "items") XDCR replicate is actually "mutations", say, suppose we send 10 docs via XDCR to remote cluster, it is possible all these docs are 10 mutations for the single document (item), rather than from 10 different docs(items). So, in the stats section, we should use "mutations", instead of "docs" when applicable.

Here is my summary, please let me know if any question or I miss anything

Commit 1: Rename current stats, just renaming, no change to the underlying stats
In the MAIN bucket section:
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR mutations"
In the Outbound XDCR stats section:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
d. Rename "queue size" as "XDCR queue size"

Commit 2: Change current stats
In the Outbound XDCR stats section:
a. Change "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints, also rename them correspondingly


Commit 3: Add new stats
In the Outbound XDCR stats section:
a. add new stat "Percentage of completeness", which is computed as the
"number of mutations already sent to remote side" / ("number of mutations already sent to remote side" + "number of mutations waiting to be sent to remote side").
Here "number of mutations waiting to be sent to remote side" is the stat "Outbound XDCR mutations"

b.add new stats "Replication rate" which is the number of mutations we sent per second to see speed of each stream. Unit: #ofmutations/per second

c.add new stats "Bandwidth in use", which is defined as the number of bytes, the bandwidth XDCR uses on the fly. Unit: Bytes/per second



Commit 4: remove all uninteresting stats and route them to logs
In Outbound XDCR stats section:
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue", move it to a logging level.



Comment by Perry Krug [ 21/Jan/13 ]
Thanks Junyie.

A couple quick questions/clarifications:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?

d. Rename "queue size" as "XDCR queue size"
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage. Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?

Commit 3: Add new stats
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.

Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
Comment by Junyi Xie (Inactive) [ 21/Jan/13 ]
Perry,

[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.

[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.

[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.

Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?

[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?



[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?

First, both are aggregated elapsed time from each vb replicator.

"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.

By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.

For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.


Comment by Perry Krug [ 21/Jan/13 ]
Thanks so much Junyie.

Perry,

[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Perfect, thank you.

[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk] - Yes, that will be great, thanks.

[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.

Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - This was the recommendation from QE based upon some analysis we did at Concur. Would be *extremely* helpful to get accurate and specific sizing information, and what takes up that size in whatever form.

[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.


[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?

First, both are aggregated elapsed time from each vb replicator.

"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.

By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.

For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Thanks for the explanation. I would still advocate for removing them. The main reason being that they do not materially help identify any issue or behavior after the cluster has been running for an extended period of time. The up-to-the-second monitoring of these stats will show an extremely high number for both after just a few days or a week of a replication stream running...let alone multiple weeks or months. I can definitely see that they would be useful when debugging the initial stream or trying to identify an issue, but I would ask that they be moved to the log or other stat area outside of the UI.

Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?

Thanks again!
Comment by Junyi Xie (Inactive) [ 22/Jan/13 ]
Perry, you are highly welcome. Please see my response below.

[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[jx] -- I see. Thanks for explanation. I agree from user perspective, it is better to have summary stat of ALL replications, not just per-replication stream.
Today seems we do not have anything like this (stats across all buckets)?, there is no stat at XDCR tab either, so I need to talk to UI guys how to add these stats and where to add them. It involves some UI design change and more than adding another per-replication stat on UI. Better to


[pk] -- Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
[jx] -- Other than UI stats, XDCR also dumped a lot of stats and information to log files, but I am afraid they are too detailed and hard to parse from customers perspective :) Today I put all XDCR stats on UI. Tomorrow, after we remove some stats on UI (like secs in checkpointing), I will put them into log and document how to get them easily. For all stats on UI, you could use standard REST API to get them.

Comment by Junyi Xie (Inactive) [ 24/Jan/13 ]
Thanks everybody for discussion. I will start working on Commit 1 and 2, which we all agree on.
Comment by Junyi Xie (Inactive) [ 24/Jan/13 ]
Commit 1: http://review.couchbase.org/#/c/24189/

Comment by Junyi Xie (Inactive) [ 28/Jan/13 ]
Commit 2: http://review.couchbase.org/#/c/24251/
Comment by Junyi Xie (Inactive) [ 12/Feb/13 ]
Commit 3: add latency stats

http://review.couchbase.org/#/c/24399/
Comment by Perry Krug [ 14/Feb/13 ]
Thanks Junyi. Do we have a bug open already for the UI enhancements around this?
Comment by Junyi Xie (Inactive) [ 14/Feb/13 ]
I mean you can open another bug for the bandwidth usage, which is purely a UI work, nothing to do with XDCR code.

For this particular bug MB-7432, all work on XDCR side is done except the stats removal (Dipti will make decision for that, probably she will file another bug). So please close this bug if you do not need any thing from me.
Comment by Perry Krug [ 14/Feb/13 ]
So it sounds like this is not yet resolved if all the decisions haven't been made yet.

Assigning to Dipti to make the final decisions...I want to leave it open to make sure things get wrapped up.

Adding a UI component for the bandwidth request.
Comment by Maria McDuff (Inactive) [ 25/Mar/13 ]
deferred out of 2.0.2
Comment by Perry Krug [ 22/Oct/13 ]
Can we revisit this for 2.5 or 3.0? All that's remaining is a decision to be made on removing some of the statistics. I still feel many of them are misleading or confusing to the user and should be moved to something more internal if still needed by our dev/QE
Comment by Junyi Xie (Inactive) [ 22/Oct/13 ]
Anil,

I think it would be nice if you can call a meeting with Perry and me to discuss which XDCR stats should be removed. We do not want remove stats which are still useful, or remove them but have to re-add them later.
Comment by Perry Krug [ 24/Oct/13 ]
Just adding the comment from 9218:

Putting "outbound XDCR mutations" on one side and "incoming XDCR mutations" on the other side makes the two seem very related. Perhaps "outbound XDCR mutations" should be "XDCR backlog" to make it clearer that it is not a rate and should not match the number on the other side.
Comment by Perry Krug [ 29/Oct/13 ]
To summarize the conversation and provide next steps:

My primary goal here is to provide meaningful and "actionable" statistics to our customers. I recognize that there may be various other stats that are useful for testing and development, but not necessarily for the end customer. The determining factor in my mind is whether we can explain "what to do" when a particular number is high or low. If we do not have that, then I suggest the statistic does not need to be displayed in the UI. Much the same way we do not expose the 300 statistics available with cbstats, I think the same logic should be applied here.

So...my requests are:
-Change "outbound XDCR mutations" to "XDCR backlog" to indicate that this is the number of mutations within the source cluster that have not yet been replicated to the destination. This stat is shown both in the "summary" as well as the per-stream "outbound xdcr operations" sections
-Change "mutations replicated optimistically" from an incrementing counter to a "per second" rate
-Remove from "outbound xdcr operations" sections:
   -mutations checked*
   -mutations replicated*
   -data replicated*
   -active vb reps±
   -waiting vb reps±
   -secs in replicating*
   -secs in checkpointing*
   -checkpoints issued±
   -checkpoints failed±
   -mutations in queue~
   -XDCR queue size~

To provide some more explanation:
(*) - These stats are constantly incrementing and therefore after weeks/months of time are not useful to describing any behavior or problem
(±) - These stats are internal implementation details, and also do not signal to the user that they should take specific action
(~) - These stats are "bounded parameters". Therefore they should never be higher than what the parameter is set to. Even if they are higher or lower, we don't have a recommendation on "what to do" back to the customer


The stats I am suggesting to remove should still be available via the REST API, but I think they are not as useful in the UI. In the field, we sometimes need to explain not only what each stat means, but "what to do" based upon the value of these statistics. I don't feel that these statistics represent something the customer needs to be concerned about nor action on.
Comment by Cihan Biyikoglu [ 11/Mar/14 ]
We will consider the feedback but UPR work has priority and we are the long pole for the release. moving to backlog. assigning to myself.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
[Minor edit on ordering...new layout below]

Discussed stats layout with Anil and Perry yesterday. Below is Anil's capture of result that conversation. I may add some more stats to this however (I'm thinking about %utilization that might be quite useful and easily doable).

Hi Alk,

Here is what we discussed on XDCR stats -
First row
Outbound XDCR mutations
Percent completed
Active vb reps
Waiting vb reps

Second row
Mutation replication rate
Data replication rate
Mutation replicated optimistically rate
Mutations checked rate

Third row
Meta ops latency
Doc ops latency
New stats
New stats

Thanks!




[MB-11820] beer-sample loading is stuck in crashed state (was: Rebalance not available 'pending add rebalanace', beer-sample loading is stuck) Created: 25/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Yes

 Description   
4 node cluster
scenario

1 - Created beer-sample right after creation of the cluster
2 - Right after bucket started loading, auto-generated load started running on it
3 - After many many minutes, I added a few nodes and noticed that I couldn't rebalance. Digging in further, I saw that the beer-sample loading was still going on but not making any progress.

Logs are at:
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.74.148.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.196.87.131.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.0.243.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.21.69.zip
https://s3.amazonaws.com/cb-customers/perry/11616/collectinfo-2014-07-25T010405-ns_1%4010.198.22.57.zip

 Comments   
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Converting this ticket to beer sample loading is stuck. Lack of rebalance warning is other existing and still in works ticket.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Here's what I have in my logs that's output from docloader:

5 matches for "output from beer-sample" in buffer: ns_server.debug.log
  19416:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - existing buckets : [u'beer-sample']\n"
  19417:[ns_server:debug,2014-07-24T23:40:07.637,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "[2014-07-24 23:40:07,637] - [rest_client] [47987464387312] - INFO - found bucket beer-sample\n"
  19450:[ns_server:debug,2014-07-24T23:40:10.387,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: "Traceback (most recent call last):\n File \"/opt/couchbase/lib/python/cbdocloader\", line 241, in ?\n main()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 233, in main\n"
  19451:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " docloader.populate_docs()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 191, in populate_docs\n self.unzip_file_and_upload()\n File \"/opt/couchbase/lib/python/cbdocloader\", line 175, in unzip_file_and_upload\n self.enumerate_and_save(working_dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 165, in enumerate_and_save\n self.enumerate_and_save(dir)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 155, in enumerate_and_save\n self.save_doc(dockey, fp)\n File \"/opt/couchbase/lib/python/cbdocloader\", line 133, in save_doc\n self.bucket.set(dockey, 0, 0, raw_data)\n File \"/opt/couchbase/lib/python/couchbase/client.py\", line 232, in set\n self.mc_client.set(key, expiration, flags, value)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 927, in set\n"
  19452:[ns_server:debug,2014-07-24T23:40:10.388,ns_1@10.198.22.57:<0.831.0>:samples_loader_tasks:wait_for_exit:99]output from beer-sample: " return self._respond(item, event)\n File \"/opt/couchbase/lib/python/couchbase/couchbaseclient.py\", line 883, in _respond\n raise item[\"response\"][\"error\"]\ncouchbase.couchbaseclient.MemcachedError: Memcached error #134: Temporary failure\n"

It don't know if docloader is truly stuck or if it is retrying and getting tmperrors all the time.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
I don't know who owns docloader but AFAIK it was Bin. I've also heard about some attempts to rewrite it in go.

CC-ed a bunch of possibly related folks.




[MB-7250] Mac OS X App should be signed by a valid developer key Created: 22/Nov/12  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.0-beta-2, 2.1.0, 2.2.0
Fix Version/s: 3.0
Security Level: Public

Type: Improvement Priority: Blocker
Reporter: J Chris Anderson Assignee: Wayne Siu
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Build_2.5.0-950.png     PNG File Screen Shot 2013-02-17 at 9.17.16 PM.png     PNG File Screen Shot 2013-04-04 at 3.57.41 PM.png     PNG File Screen Shot 2013-08-22 at 6.12.00 PM.png     PNG File ss_2013-04-03_at_1.06.39 PM.png    
Issue Links:
Dependency
depends on MB-9437 macosx installer package fails during... Closed
Relates to
relates to CBLT-104 Enable Mac developer signing on Mac b... Open

 Description   
Currently launching the Mac OS X version tells you it's from an unidentified developer. You have to right click to launch the app. We can fix this.

 Comments   
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
Chris,

do you know what needs to change on the build machine to embed our developer key ?
Comment by J Chris Anderson [ 22/Nov/12 ]
I have no idea. I could start researching how to get a key from Apple but maybe after the weekend. :)
Comment by Farshid Ghods (Inactive) [ 22/Nov/12 ]
we can discuss this next week : ) . Thanks for reporting the issue Chris.
Comment by Steve Yen [ 26/Nov/12 ]
we'll want separate, related bugs (tasks) for other platforms, too (windows, linux)
Comment by Jens Alfke [ 30/Nov/12 ]
We need to get a developer ID from Apple; this will give us some kind of cert, and a local private key for signing.
Then we need to figure out how to get that key and cert onto the build machine, in the Keychain of the account that runs the buildbot.
Comment by Farshid Ghods (Inactive) [ 02/Jan/13 ]
the instructions to build is available here :
https://github.com/couchbase/couchdbx-app
we need to add codesign as a build step there
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
Phil,

do you have any update on this ticket. ?
Comment by Phil Labee [ 22/Jan/13 ]
I have signing cert installed on 10.17.21.150 (MacBuild).

Change to Makefile: http://review.couchbase.org/#/c/24149/
Comment by Phil Labee [ 23/Jan/13 ]
need to change master.cfg and pass env.var. to package-mac
Comment by Phil Labee [ 29/Jan/13 ]
disregard previous. Have added signing to Xcode projects.

see http://review.couchbase.org/#/c/24273/
Comment by Phil Labee [ 31/Jan/13 ]
To test this go to System Preferences / Security & Privacy, and on the General tab set "Allow applications downloaded from" to "Mac App Store and Identified Developers". Set this before running Couchbase Server.app the first time. Once an app has been allowed to run this setting is no longer checked for that app, and there doesn't seem to be a way to reset that.

What is odd is that on my system, I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked (and would all be allowed to run). Either there is a flaw in my testing methodology, or a serious weakness in this security setting: Just because one app called Couchbase Server was allowed to run should confer this privilege to other apps with the same name. A common malware tactic is to modify a trusted app and distribute it as update, and if the security setting keys off the app name it will do nothing to prevent that.

I'm approving this change without having satisfactorily tested it.
Comment by Jens Alfke [ 31/Jan/13 ]
Strictly speaking it's not the app name but its bundle ID, i.e. "com.couchbase.CouchbaseServer" or whatever we use.

> I allowed one unsigned build to run before restricting the app run setting, and then no other unsigned builds would be checked

By OK'ing an unsigned app you're basically agreeing to toss security out the window, at least for that app. This feature is really just a workaround for older apps. By OK'ing the app you're not really saying "yes, I trust this build of this app" so much as "yes, I agree to run this app even though I don't trust it".

> A common malware tactic is to modify a trusted app and distribute it as update

If it's a trusted app it's hopefully been signed, so the user wouldn't have had to waive signature checking for it.
Comment by Jens Alfke [ 31/Jan/13 ]
Further thought: It might be a good idea to change the bundle ID in the new signed version of the app, because users of 2.0 with strict security settings have presumably already bypassed security on the unsigned version.
Comment by Jin Lim [ 04/Feb/13 ]
Per bug scrubs, keep this a blocker since customers ran into this issues (and originally reported it).
Comment by Phil Labee [ 06/Feb/13 ]
revert the change so that builds can complete. App is currently not being signed.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
i suggest for 2.0.1 release we do this build manually.
Comment by Jin Lim [ 11/Feb/13 ]
As one-off fix, add the signature manually and automate the required steps later in 2.0.2 or beyond.
Comment by Jin Lim [ 13/Feb/13 ]
Please move this bug to 2.0.2 after populating the required signature manually. I am lowing the severity to critical for it isn't no longer a blocking issue.
Comment by Farshid Ghods (Inactive) [ 15/Feb/13 ]
Phil to upload the binary to latestbuilds , ( 2.0.1-101-rel.zip )
Comment by Phil Labee [ 15/Feb/13 ]
Please verify:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 15/Feb/13 ]
uploaded:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip

I can rename it when uploading for release.
Comment by Farshid Ghods (Inactive) [ 17/Feb/13 ]
i still do get the error that it is from an identified developer.

Comment by Phil Labee [ 18/Feb/13 ]
operator error.

I rebuilt the app, this time verifying that the codesign step occurred.

Uploaded now file to same location:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-160-rel-signed.zip
Comment by Phil Labee [ 26/Feb/13 ]
still need to perform manual workaround
Comment by Phil Labee [ 04/Mar/13 ]
release candidate has been uploaded to:

http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip
Comment by Wayne Siu [ 03/Apr/13 ]
Phil, looks like version 172/185 is still getting the error. My Mac version is 10.8.2
Comment by Thuan Nguyen [ 03/Apr/13 ]
Install couchbase server (build 2.0.1-172 community version) in my mac osx 10.7.4 , I only see the warning message
Comment by Wayne Siu [ 03/Apr/13 ]
Latest version (04.03.13) : http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.zip
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
works in 10.7 but not in 10.8.
if we can get the fix for 10.8 by tomorrow, end of day, QE is willing to test for release on tuesday, april 9.
Comment by Phil Labee [ 04/Apr/13 ]
The mac builds are not being automatically signed, so build 185 is not signed. The original 172 is also not signed.

Did you try

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-172-signed.zip

to see if that was signed correctly?

Comment by Wayne Siu [ 04/Apr/13 ]
Phil,
Yes, we did try the 172-signed version. It works on 10.7 but not 10.8. Can you take a look?
Comment by Phil Labee [ 04/Apr/13 ]
I rebuilt 2.0.1-185 and uploaded a signed app to:

    http://packages.northscale.com/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.SIGNED.zip

Test on a machine that has never had Couchbase Server installed, and has the security setting to only allow Appstore or signed apps.

If you get the "Couchbase Server.app was downloaded from the internet" warning and you can click OK and install it, then this bug is fixed. The quarantining of files downloaded by a browser is part of the operating system and is not controlled by signing.
Comment by Wayne Siu [ 04/Apr/13 ]
Tried the 185-signed version (see attached screen shot). Same error message.
Comment by Phil Labee [ 04/Apr/13 ]
This is not an error message related to this bug.

Comment by Maria McDuff (Inactive) [ 14/May/13 ]
per bug triage, we need to have mac 10.8 osx working since it is a supported platform (published in the website).
Comment by Wayne Siu [ 29/May/13 ]
Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Anil Kumar [ 31/May/13 ]
we need to address signed key for both Windows and Mac deferring this to next release.
Comment by Dipti Borkar [ 08/Aug/13 ]
Please let's make sure this is fixed in 2.2.
Comment by Phil Labee [ 16/Aug/13 ]
New keys will be created using new account.
Comment by Phil Labee [ 20/Aug/13 ]
iOS Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=iOS Distribution expires Aug 12, 2014

    ~buildbot/Desktop/appledeveloper.couchbase.com/certs/ios/ios_distribution_appledeveloper.couchbase.com.cer

Identifiers:
  App IDS:
    "Couchbase Server" id=com.couchbase.*

Provisining Profiles:
  Distribution:
    "appledeveloper.couchbase.com" type=Distribution

  ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/ios/appledevelopercouchbasecom.mobileprovision
Comment by Phil Labee [ 20/Aug/13 ]
Mac Apps
--------------
Certificates:
  Production:
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)
    "Couchbase, Inc." type=Developer ID installer (Aug,16,2014)
    "Couchbase, Inc." type=Developer ID Application (Aug,16,2014)
    "Couchbase, Inc." type=Mac App Distribution (Aug,15,2014)

     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developerID_installer.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/developererID_application.cer
     ~buildbot/Desktop/appledeveloper.couchbase.com/certs/mac_app/mac_app_distribution-2.cer

Identifiers:
  App IDs:
    "Couchbase Server" id=couchbase.com.* Prefix=N2Q372V7W2
    "Coucbase Server adhoc" id=couchbase.com.* Prefix=N2Q372V7W2
    .

Provisioning Profiles:
  Distribution:
    "appstore.couchbase.com" type=Distribution
    "Couchbase Server adhoc" type=Distribution

     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/appstorecouchbasecom.privisioningprofile
     ~buildbot/Desktop/appledeveloper.couchbase.com/profiles/Couchbase_Server_adhoc.privisioningprofile

Comment by Phil Labee [ 21/Aug/13 ]

As of build 2.2.0-806 the app is signed by a new provisioning profile
Comment by Phil Labee [ 22/Aug/13 ]
 Install version 2.2.0-806 on a macosx 10.8 machine that has never had Couchbase Server installed, which has the security setting to require applications to be signed with a developer ID.
Comment by Phil Labee [ 22/Aug/13 ]
please assign to tester
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
just tried this against newest build 809:
still getting restriction message. see attached.
Comment by Maria McDuff (Inactive) [ 22/Aug/13 ]
restriction still exists.
Comment by Maria McDuff (Inactive) [ 28/Aug/13 ]
verified in rc1 (build 817). still not fixed. getting same msg:
“Couchbase Server” can’t be opened because it is from an unidentified developer.
Your security preferences allow installation of only apps from the Mac App Store and identified developers.

Work Around:
Step One
Hold down the Control key and click the application icon. From the contextual menu choose Open.

Step Two
A popup will appear asking you to confirm this action. Click the Open button.
Comment by Phil Labee [ 03/Sep/13 ]
Need to create new certificates to replace these that were revoked:

Certificate: Mac Development
Team Name: Couchbase, Inc.

Certificate: Mac Installer Distribution
Team Name: Couchbase, Inc.

Certificate: iOS Development
Team Name: Couchbase, Inc.

Certificate: iOS Distribution
Team Name: Couchbase, Inc.
Comment by Maria McDuff (Inactive) [ 18/Sep/13 ]
candidate for 2.2.1 bug fix release.
Comment by Dipti Borkar [ 28/Oct/13 ]
This is going to make it into 2.5? We seemed to keep differing it?
Comment by Phil Labee [ 29/Oct/13 ]
cannot test changes with installer that fails
Comment by Phil Labee [ 11/Nov/13 ]
Installed certs as buildbot and signed app with "(recommended) 3rd Party Mac Developer Application", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-001.zip

Signed with "(Oct 30) 3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)", producing

    http://factory.hq.couchbase.com//couchbase_server_2.5.0_MB-7250-002.zip

These zip files were made on the command line, not a result of the make command. They are 2.5G in size, so they obviously include mote than the zip files produced by the make command.

Both versions of the app appear to be signed correctly!

Note: cannot run make command from ssh session. Must Remote Desktop in and use terminal shell natively.
Comment by Phil Labee [ 11/Nov/13 ]
Finally, some progress: If the zip file is made using the --symlinks argument it appears to be un-signed. If the symlinked files are included, the app appears to be signed correctly.

The zip file with symlinks is 60M, while the zip file with copies of the files is 2.5G, more than 40X the size.
Comment by Phil Labee [ 25/Nov/13 ]
Fixed in 2.5.0-950
Comment by Dipti Borkar [ 25/Nov/13 ]
Maria, can QE please verify this?
Comment by Wayne Siu [ 28/Nov/13 ]
Tested with build 2.5.0-950. Still see the warning box (attached).
Comment by Wayne Siu [ 19/Dec/13 ]
Phil,
Can you give an update on this?
Comment by Ashvinder Singh [ 14/Jan/14 ]
I tested the code signature with apple utility "spctl -a -v /Applications/Couchbase\ Server.app/" and got the output :
>>> /Applications/Couchbase Server.app/: a sealed resource is missing or invalid

also tried running the command:
 
bash: codesign -dvvvv /Applications/Couchbase\ Server.app
>>>
Executable=/Applications/Couchbase Server.app/Contents/MacOS/Couchbase Server
Identifier=com.couchbase.couchbase-server
Format=bundle with Mach-O thin (x86_64)
CodeDirectory v=20100 size=639 flags=0x0(none) hashes=23+5 location=embedded
Hash type=sha1 size=20
CDHash=868e4659f4511facdf175b44a950b487fa790dc4
Signature size=4355
Authority=3rd Party Mac Developer Application: Couchbase, Inc. (N2Q372V7W2)
Authority=Apple Worldwide Developer Relations Certification Authority
Authority=Apple Root CA
Signed Time=Jan 8, 2014, 10:59:16 AM
Info.plist entries=31
Sealed Resources version=1 rules=4 files=5723
Internal requirements count=1 size=216

It looks like the code signature is present but got invalid as the new file were added/modified to the project. I suggest for the build team to rebuild and add the code signature again.
Comment by Phil Labee [ 17/Apr/14 ]
need VM to clone for developer experimentation
Comment by Anil Kumar [ 18/Jul/14 ]
Any update on this? We need this for 3.0.0 GA.

Please update the ticket.

Triage - July 18th




[MB-11808] GeoSpatial in 3.0 Created: 24/Jul/14  Updated: 25/Jul/14

Status: Open
Project: Couchbase Server
Component/s: ns_server, UI, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sriram Melkote Assignee: Volker Mische
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We must hide GeoSpatial related UI elements in 3.0 release, as we have not completed the task of moving GeoSpatial features over to UPR.

We should use the simplest way to hide elements (like "display:none" attribute) because we fully expect to resurface this in 3.0.1


 Comments   
Comment by Sriram Melkote [ 24/Jul/14 ]
In the 3.0 release meeting, it was fairly clear that we won't be able to add Geo support for 3.0 due to the release being in Beta phase now and heading to code freeze soon. So, we should plan for it in 3.0.1 - updating description to reflect this.




[MB-10685] XDCR Stats: Negative values seen for mutation replication rate and data replication rate Created: 28/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-03-28 at 11.51.35 AM.png     PNG File Screen Shot 2014-03-28 at 11.51.54 AM.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Seen in many scenarios where data outflow dips to 0 temporarily.

Negative mutation replication rate measured in number of mutations/sec and data replication rate measured in B/KB do not make sense. Should be 0 or positive.

 Comments   
Comment by Pavel Paulau [ 23/Apr/14 ]
Saw negative "outbound XDCR mutations" as well.

Can't agree with Minor status.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
MB-7432 takes this into account already




[MB-9707] users may see incorrect "Outbound mutations" stat after topology change at source cluster (was: Rebalance in/out operation on Source cluster caused outbound replication mutations != 0 for long time while no write operation on source cluster) Created: 10/Dec/13  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication, test-execution
Affects Version/s: 2.5.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Sangharsh Agarwal Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.5.0 build 991

Attachments: File MB-9707-test_log.rtf     PNG File outboundmutations.png     PNG File Screen Shot 2014-01-22 at 11.36.06 AM.png     PNG File Snap-shot-2.png    
Issue Links:
Duplicate
duplicates MB-9745 When XDCR streams encounter any error... Closed
Relates to
relates to MB-9960 2.5 Release Note: users may see incor... Resolved
Triage: Triaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: Maria, please check the title and my last comment, let me know if you need anything from me

 Description   
[Test case]
./testrunner -i ./xdcr.1.ini -t xdcr.rebalanceXDCR.Rebalance.swap_rebalance_out_master,items=1000,rdirection=unidirection,ctopology=chain,doc-ops=update-delete,rebalance=source

[Test Exception]
======================================================================
FAIL: swap_rebalance_out_master (xdcr.rebalanceXDCR.Rebalance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytests/xdcr/rebalanceXDCR.py", line 328, in swap_rebalance_out_master
    elif self._replication_direction_str in "bidirection":
  File "pytests/xdcr/xdcrbasetests.py", line 714, in verify_results
    else:
  File "pytests/xdcr/xdcrbasetests.py", line 683, in verify_xdcr_stats
    timeout = max(120, end_time - time.time())
  File "pytests/xdcr/xdcrbasetests.py", line 661, in __wait_for_mutation_to_replicate
AssertionError: Timeout occurs while waiting for mutations to be replicated

----------------------------------------------------------------------

[Test Steps]

1. Create 2-2 nodes Source and Destination clusters.
2. Create default bucket on both the clusters.
3. Setup CAPI mode XDCR from source-destination.
4. Load 1000 items on source cluster.
5. Do swap-rebalance master node on source cluster.
6. After rebalance is finished, wait for rebalance_changes_left to 0 on source side. --> Test failed here, getting rebalance_changes_left as 1 always on source cluster.
7. Verify items.

[Bug description]
Outbound replication mutations doesn't goes to 0 after rebalance.


 Comments   
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
[Bug description]
Outbound replication (replication_changes_left stat is non-zero) mutations doesn't goes to 0 after rebalance. Snapshot of oubound stats at source side is attached.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
[Additional information]
There were meta read operations on destination side also during rebalance in/out operations.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
There are two issues to be investigated:

1. Outbound xdcr replication left as 1, while it should be zero eventually. -> Major issue.
2. During rebalance in/out operation on Source why there were meta reads from destination side while there is no mutation taken place on Source cluster except intra cluster re-balancing.
Comment by Junyi Xie (Inactive) [ 10/Dec/13 ]
First, what build are your using?

2 is expected. For 1, what is your timeout? Recently we checked in code to make replicator wait for 30 secs before making second try if error happened. That may delay the replication in some testcases.

I would suggest redo the test manually, and see how long the remaining item can be flushed. Also, seems 1K items is a bit too small for general testing., 100K - 500K makes more sense to me.
Comment by Sangharsh Agarwal [ 10/Dec/13 ]
Build 2.5.0 991.

For 2, Please brief about, why there are meta reads on destination side?

For 1 timeout is almost 20 minutes (there are several checks for ep_queue, curr_items, vb_active_times and then replication_changes_left). Here Outbound replication mutation stopped at 1, I waited for 5-10 more minutes but it doesn't come back to zero (as we can see straight line in the graph) and also there was no other operation(s) running in parallel e.g. get, set, rebalance etc. Additional point is that this problem occurring only in case of re-balancing on cluster where rebalance is performed, not with any other operation.

In verification steps we do as follows:

1. First we wait for ep_queue to be 0. (750 seconds)
2. Second we wait for curr_items and vb_active_times to be as expected. (750 - x second spent in step 1)
3. We wait for replication_changes_left to 0, but it stuck to 1. (180 seconds timeout)

Some test with more than 1K items also failed because of this. I will do the test manually also tomorrow.
Comment by Junyi Xie (Inactive) [ 11/Dec/13 ]
For 2, rebalance (topology change) causes vb migration, the vb replicator will start at new home node and see if there is anything to replicate. The scanning process will trigger some traffic to destination side. These are pure getMeta ops. If you have already replicated everything, no data will be replicated after rebalance.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
But is re-occurred on 999 build. I have added 8 minutes to wait for mutations to be 0. But it is timeout this time also.

Attaching snapshot of another execution, where it stuck for 16 outbound replications.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
Increasing the severity to blocker as many tests are failing because of this issue.
Comment by Sangharsh Agarwal [ 13/Dec/13 ]
Test is failing with 10K items also.
Comment by Junyi Xie (Inactive) [ 14/Dec/13 ]
I made a toybuild with tentative fixes. Please retest with http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-master-toy-junyi-x86_64_0.0.0-MB9663A-toy.rpm

Please update with results and logs

Thanks,
Comment by Sangharsh Agarwal [ 16/Dec/13 ]
Junyi,
   Is this fix also merged?
Comment by Sangharsh Agarwal [ 17/Dec/13 ]
I am still getting this error after installing of this toy build.

[Source Cluster]
10.3.4.176 (Master node) -> https://s3.amazonaws.com/bugdb/jira/MB-9707/c5236067/mb9707_repro_2.zip
10.3.2.161 -> https://s3.amazonaws.com/bugdb/jira/MB-9707/a30d2f16/mb9707_repro_2.zip
172.23.106.21 -> New node -> https://s3.amazonaws.com/bugdb/jira/MB-9707/b1b5e88a/mb9707_repro_2.zip

[Destination Cluster]
10.3.4.175 (Master node) -> https://s3.amazonaws.com/bugdb/jira/MB-9707/186497be/mb9707_repro_2.zip
172.23.106.22 -> https://s3.amazonaws.com/bugdb/jira/MB-9707/f5dd0a9b/mb9707_repro_2.zip


Replication was created at below time

[user:info,2013-12-17T2:31:52.064,ns_1@10.3.4.176:<0.25673.7>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.

Please analyse the logs are for this time.

Note: Just to add there are two servers has their clock 1 minutes 30 second delay with others.
Comment by Junyi Xie (Inactive) [ 17/Dec/13 ]
Hi Sangharsh,

A few things

1) Inconsistent clock would cause some confusion in reading the logs. Please fix them if you can, otherwise please clearly which node has the delay. In the last test, source 10.3.4.176 and 10.3.2.161 apparently have different clocks, but for 172.23.106.21 it is not clear to which other node it is consistent with.
2) Do you know which source node has remaining mutations not replicated? You can see that from UI, but it is not shown in the uploaded screenshots. From logs, there is no error on 10.3.4.176 and 172.23.106.21 but 10.3.2.161 has some db_not_found errors when replication was created. Usually that is because you create the XDCR right after bucket is created on destination side. To avoid that, please wait for 30 seconds after you create buckets and before start XDCR. These errors triggered vb replicator crashed and restarted, it is known that restart is not uniform (MB-9745),
3) Did you get chance to run the test manually and reproduce?


To speed up the process, I would like to have a meeting with you and run the test, so we can monitor the test together. Please be free to schedule a meeting tomorrow (Wednesday Dec 18th) at convenience, I guess probably morning 10AM EST is good for both of us. Thanks.
Comment by Sangharsh Agarwal [ 17/Dec/13 ]
Junyi,
    Same problem is occurring in fews tests in Jenkin job also. Please see http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/18/consoleFull.

>2) Do you know which source node has remaining mutations not replicated?

It is on 10.3.2.161, 10.3.4.176 is swap rebalance with 172.23.106.21.

> Usually that is because you create the XDCR right after bucket is created on destination side. To avoid that, please wait for 30 seconds after you create buckets and before start XDCR. These errors triggered vb replicator crashed and restarted, it is known that restart is not uniform (MB-9745).

Problem is occurring in updating the mutations (deletes and updates) while data is being replicating before swap rebalance is started.

I have sent you the invite for the discussion.

Comment by Junyi Xie (Inactive) [ 18/Dec/13 ]
Re-ran the test with Sangharsh using the toybuild but did not reproduce. It does not sound like a code bug but rather a test issue. Two things need to fix in test: 1) after remote bucket is created, wait for 30 seconds before creating XDCR (db_not_found was seen in test even before rebalance) 2) reduce vb replicator restart interval from default 30 seconds to 5 seconds to speed up data sync-up.

BTW, all fixes in toybuild have been merged.
Comment by Sangharsh Agarwal [ 18/Dec/13 ]
Junyi,
  Currently all XDCR tests are running with "xdcrFailureRestartInterval" = 1 second. Is it OK?
Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Hi Sangharsh,

It really depends what test you are running. That is the reason we have this parameter. For example, if test with small writes (like Paypal usecase) and no topology change at destination side, it is OK to use small restart interval.
But in other cases like test involving long-time rebalance at destination, it does not make sense to restart every 1 second. So my suggestion is

1) understand the test
2) manually run the test beforehand to figure out the reasonable restart interval
3) modify automated test accordingly
Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Sangharsh,

1) please upgrade your build to latest, and
2) send me the ini file you use, I will run the test myself.

Comment by Junyi Xie (Inactive) [ 19/Dec/13 ]
Hi Sangharsh,

I tried the test with my own ini file with 5 VM nodes. The test pass without any problem (see part of log below). I use build 1013. Not sure what happened to you.

Although the test passes, I found a potential problem in test which may possibly fail the test in some cases. After your load deletion of 300 items into source cluster after rebalance, you wait only 30 seconds before merging the buckets. If everything runs perfectly, there is no problem. But if you hit any error, the replicator will restart. In this case the 30 seconds waiting time may be too short for them to restart and catch up replicating all items.

2013-12-19 20:52:56 | INFO | MainProcess | MainThread | [xdcrbasetests.sleep] sleep for 30 secs. ...
2013-12-19 20:53:26 | INFO | MainProcess | MainThread | [xdcrbasetests.merge_buckets] merge buckets 10.3.2.43->10.3.3.101, bidirection:False





13-12-19 20:54:47 | INFO | MainProcess | MainThread | [xdcrbasetests.verify_xdcr_stats] and Verify xdcr replication stats at Source Cluster : 10.3.2.43
2013-12-19 20:54:50 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:54:52 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.2.43:8091',default bucket
2013-12-19 20:54:53 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:54:56 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.104:8091',default bucket
2013-12-19 20:54:59 | INFO | MainProcess | MainThread | [xdcrbasetests.verify_xdcr_stats] Verify xdcr replication stats at Destination Cluster : 10.3.3.101
2013-12-19 20:55:00 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:55:04 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.101:8091',default bucket
2013-12-19 20:55:04 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:55:08 | INFO | MainProcess | Cluster_Thread | [task.check] Saw ep_queue_size 0 == 0 expected on '10.3.3.103:8091',default bucket
2013-12-19 20:55:12 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:15 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:18 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 700 == 700 expected on '10.3.2.43:8091''10.3.3.104:8091',default bucket
2013-12-19 20:55:18 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:22 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:25 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 700 == 700 expected on '10.3.2.43:8091''10.3.3.104:8091',default bucket
2013-12-19 20:55:29 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:55:32 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:55:35 | INFO | MainProcess | MainThread | [task.__init__] 1000 items will be verified on default bucket
2013-12-19 20:55:35 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2013-12-19 20:55:35 | INFO | MainProcess | load_gen_task | [data_helper.getMulti] Can not import concurrent module. Data for each server will be got sequentially
2013-12-19 20:56:19 | INFO | MainProcess | load_gen_task | [task.has_next] 1000 items were verified in 44.8155920506 sec.the average number of ops - 22.3136571342 per second
2013-12-19 20:56:21 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:25 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:27 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 700 == 700 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2013-12-19 20:56:28 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:31 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:33 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 700 == 700 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2013-12-19 20:56:35 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:56:38 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:56:40 | INFO | MainProcess | MainThread | [task.__init__] 1000 items will be verified on default bucket
2013-12-19 20:56:41 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2013-12-19 20:56:41 | INFO | MainProcess | load_gen_task | [data_helper.getMulti] Can not import concurrent module. Data for each server will be got sequentially
2013-12-19 20:57:22 | INFO | MainProcess | load_gen_task | [task.has_next] 1000 items were verified in 42.144990921 sec.the average number of ops - 23.7276109742 per second
2013-12-19 20:57:26 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:57:29 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:57:34 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:57:37 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:57:39 | INFO | MainProcess | load_gen_task | [task.has_next] Verification done, 0 items have been verified (updated items: 0)
2013-12-19 20:57:42 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.2.43:11210 default
2013-12-19 20:57:46 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.104:11210 default
2013-12-19 20:57:50 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2013-12-19 20:57:53 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2013-12-19 20:57:56 | INFO | MainProcess | load_gen_task | [task.has_next] Verification done, 0 items have been verified (deleted items: 0)
2013-12-19 20:57:56 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== XDCRbasetests stats for test #1 swap_rebalance_out_master ==============
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] Type of run: UNIDIRECTIONAL XDCR
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.2.43 and destination at 10.3.3.101
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.00301204819277 K ops per second
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.0118389172996 K ops per second
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== = = = = = = = = END = = = = = = = = = = ==============
2013-12-19 20:57:57 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== rebalanceXDCR cleanup was started for test #1 swap_rebalance_out_master ==============
2013-12-19 20:57:58 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [u'default'] on 10.3.2.43
2013-12-19 20:57:58 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] remove bucket default ...
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleted bucket : default from 10.3.2.43
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [bucket_helper.wait_for_bucket_deletion] waiting for bucket deletion to complete....
2013-12-19 20:58:03 | INFO | MainProcess | MainThread | [rest_client.bucket_exists] existing buckets : []
2013-12-19 20:58:05 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
2013-12-19 20:58:05 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.104&user=Administrator&knownNodes=ns_1%4010.3.3.104%2Cns_1%4010.3.2.43
2013-12-19 20:58:06 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance operation started
2013-12-19 20:58:08 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] rebalance progress took 2.56104779243 seconds
2013-12-19 20:58:08 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] sleep for 2.56104779243 seconds after rebalance...
2013-12-19 20:58:11 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] removed all the nodes from cluster associated with ip:10.3.2.43 port:8091 ssh_username:root ? [(u'ns_1@10.3.3.104', 8091)]
2013-12-19 20:58:12 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.2.43:8091
2013-12-19 20:58:12 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.2.43:8091 is running
2013-12-19 20:58:13 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 10.3.3.104
2013-12-19 20:58:15 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.104:8091
2013-12-19 20:58:16 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.104:8091 is running
2013-12-19 20:58:17 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [u'default'] on 10.3.3.101
2013-12-19 20:58:17 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] remove bucket default ...
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleted bucket : default from 10.3.3.101
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [bucket_helper.wait_for_bucket_deletion] waiting for bucket deletion to complete....
2013-12-19 20:58:21 | INFO | MainProcess | MainThread | [rest_client.bucket_exists] existing buckets : []
2013-12-19 20:58:23 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
2013-12-19 20:58:24 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.103&user=Administrator&knownNodes=ns_1%4010.3.3.101%2Cns_1%4010.3.3.103
2013-12-19 20:58:24 | INFO | MainProcess | MainThread | [rest_client.rebalance] rebalance operation started
2013-12-19 20:58:27 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] rebalance progress took 2.58956503868 seconds
2013-12-19 20:58:27 | INFO | MainProcess | MainThread | [rest_client.monitorRebalance] sleep for 2.58956503868 seconds after rebalance...
2013-12-19 20:58:30 | INFO | MainProcess | MainThread | [cluster_helper.cleanup_cluster] removed all the nodes from cluster associated with ip:10.3.3.101 port:8091 ssh_username:root ? [(u'ns_1@10.3.3.103', 8091)]
2013-12-19 20:58:31 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.101:8091
2013-12-19 20:58:31 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.101:8091 is running
2013-12-19 20:58:32 | INFO | MainProcess | MainThread | [bucket_helper.delete_all_buckets_or_assert] deleting existing buckets [] on 10.3.3.103
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] waiting for ns_server @ 10.3.3.103:8091
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [cluster_helper.wait_for_ns_servers_or_assert] ns_server @ 10.3.3.103:8091 is running
2013-12-19 20:58:34 | INFO | MainProcess | MainThread | [rebalanceXDCR.tearDown] ============== rebalanceXDCR cleanup was finished for test #1 swap_rebalance_out_master ==============
ok

----------------------------------------------------------------------
Ran 1 test in 1330.685s

OK
summary so far suite xdcr.rebalanceXDCR.Rebalance , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-13-Dec-19_20-36-24
Run after suite setup for xdcr.rebalanceXDCR.Rebalance.swap_rebalance_out_master
Junyis-MacBook-Pro:testrunner junyi$
Comment by Junyi Xie (Inactive) [ 20/Dec/13 ]
Based on investigation, it is a test issue instead of a code bug. Not a blocker.
Comment by Anil Kumar [ 20/Dec/13 ]
Sangharsh - Please confirm this is a test issue as mentioned by Junyi if not reopen with details.
Comment by Sangharsh Agarwal [ 22/Dec/13 ]
Anil, Currently almost 16-18 test cases are failed (as per latest jenkin execution on 22nd December) because of this issue. So, I am reviewing each test as per Junyi's comment and will update this issue soon.
Comment by Sangharsh Agarwal [ 26/Dec/13 ]
Junyi,
   After adding your suggestions of waiting after create buckets and destination and after del/update ops, still this bug is occurring on Jenkins and many issue is failing because of this. Recent execution on Jenkins almost 18 jobs are failed because of this issue. Can you please check the code which update the stats for outbound mutations (replication_chages_left)? It might be the case of wrong updation of the stat value also, because number of items on both the side are in sync.
Comment by Sangharsh Agarwal [ 27/Dec/13 ]
[Automated test steps]

1. Create 4 node source cluster and 3 node destination cluster.
2. Setup bidirectional replication for default bucket in CAPI mode.
3. Load 10000 items on both Source side and destination side.
4. Wait for 60 seconds to ensure if replication is completed.
5. Failover one non-master node at destination side.
6. Add back the node.
7. Wait for 60 seconds.
8. Perform 30% update and delete at source and destination side.
9. Wait for 120 seconds
10. Verify results. -> Here it is failed since outbound mutation was non-zero.
Comment by Sangharsh Agarwal [ 27/Dec/13 ]
When I increased the timeout in Step-3 to 180 seconds and 120 seconds in Step-7 here then test is passed. But still this kind of fix are temporary because behaviour is not consistent. I mean timeout depends on various factor e.g. load data, replication mode (capi/xmem), number of updation/deletion etc.

Additionally we also need to know why the replication_changes_left is not zero after the completion and stuck at some non-zero value forever.
Also, I have specifically looked for keyword “changes_left"on the couchbase server logs but couldn't find any clue that would help in understanding the cause of the issue. So please suggest if there's any specific keyword that i can use in the log to analyze further and help find the root cause.

FYI, We didn't have this verification (To check if all mutations are replicated) in place in the test code because we have added this one month back only. That is why we are facing this kind of issue(s) first time.
Comment by Sangharsh Agarwal [ 29/Dec/13 ]
http://qa.sc.couchbase.com/job/ubuntu_x64--36_01--XDCR_upgrade-P1/30/consoleFull

4 Upgrade tests are also failed because of this issue, while number of items were not large.
Comment by Junyi Xie (Inactive) [ 30/Dec/13 ]
See test with Sangharsh even without any rebalance. Several replicators crashed unexpectedly, The Error code captured by XDCR is "NIL" which should be a bug and likely related to recent xdcr-over-ssl change from ns_server team.

Two questions need answers from ns_server team

1) why replicator crashed due to http_request_failed error. By Sangharsh, there is no topology change on both sides and the writes is very small (less than1k/sec bucket-wise), which should not cause any stress issue.
2) why the error code returned from remote is "nil", this does not provide any insights on what happened.
 

error_logger:error,2013-12-30T9:27:12.746,ns_1@10.3.2.109:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.10675.5> terminating
** Last message in was start_replication
** When Server state == [{data,
                          [{"State",
                            {rep_state,
                             {rep,
                              <<"1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              <<"default">>,
                              <<"/remoteClusters/1a57639231dd745992cbe1736c7f1c8c/buckets/default">>,
                              "xmem",
                              [{optimistic_replication_threshold,256},
                               {worker_batch_size,500},
                               {remote_proxy_port,11214},

                               {cert,
                                <<"-----BEGIN CERTIFICATE-----\r\nMIICmDCCAYKgAwIBAgIIE0STzcUZwvAwCwYJKoZIhvcNAQEFMAwxCjAIBgNVBAMT\r\nASowHhcNMTMwMTAxMDAwMDAwWhcNNDkxMjMxMjM1OTU5WjAMMQowCAYDVQQDEwEq\r\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA4zWT9StoHrRlaQHevX7v\r\ny4l/9RW2PJDpiSjriGOPK9Vpn5lQ5KBqlBbftLIZ+M2hclXQe4fvh1tS12hU5vLB\r\n9zAKsLlF/vyELa9e\
JHuykdMhBuu55VgJCm+m+WzrKSaEmZ837Dbawv7+Bpesyk0N\r\nMX96HrNY83KlzFVl/gwKsXK5TvuoHrfQ8g4odDZEDjnW1VlcAOaISNa8NwCpSrx0\r\n1eXqFnm9cax3FPCS8rZBd8KbFvWXBSFVH/Vpn+03godir1Rn3+nJteWV9S/3Kgap\r\n2TTtFAi4crsVsdcTbezEOI6l0TNL7yjq2yDzNvVKLugX9XA6W3wh4/Nbmu/sgHyo\r\ndwIDAQABowIwADALBgkqhkiG9w0BAQUDggEBACsdjy/32B/osqbgsNvbyjlGlOOY\r\nGZY4HoPgHFZciDqPo9XZ64zHyIAnZ/\
Oy/5rdcajhmFixgIuEj0pNhLRPKbRzeXQ3\r\nG1wtW7YeK1BGrUSmSgZi9BIfLUEPmYiSYmwSnXlwNNFpKoOhcuxgZ97E6RUqdqLq\r\nwF4P7dpw5CXWudpLH9TqEuk7fxzK6ANTC9kgXEqr8+GOqAzG4VAtpEug/EeOI0Wr\r\nB0q6xT7rUvnDnPIr3MPb+aNXU2mHKSpz6nntkaJ+VHyGhlMNgjyICPzrECvC2Pol\r\nKaDxA3I5knrwMQzAspRq4VEafXQYnnjCFMBzzXaQ/P61P7GFpg3InrOqlvs=\r\n-----END CERTIFICATE-----\r\n">>}]},
                             <0.10418.5>,<0.10414.5>,<<"default/156">>,
                             <<"*****@10.3.4.175:18092/default%2f156%3bf52074c035e5ad86641a058dda01caaf">>,
                             undefined,undefined,undefined,undefined,[],
                             {[{<<"session_id">>,
                                <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>},
                               {<<"source_last_seq">>,44},
                               {<<"start_time">>,
                                <<"Mon, 30 Dec 2013 16:31:47 GMT">>},
                               {<<"end_time">>,
                                <<"Mon, 30 Dec 2013 17:10:34 GMT">>},
                               {<<"docs_checked">>,44},
                               {<<"docs_written">>,44},
                               {<<"data_replicated">>,21108},
                               {<<"history">>,
                                [{[{<<"session_id">>,
                                    <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>},
                                   {<<"start_time">>,
                                    <<"Mon, 30 Dec 2013 16:31:47 GMT">>},
                                   {<<"end_time">>,
                                    <<"Mon, 30 Dec 2013 17:10:34 GMT">>},
                                   {<<"start_last_seq">>,0},
                                   {<<"end_last_seq">>,44},
                                   {<<"recorded_seq">>,44},
                                   {<<"docs_checked">>,44},
                                   {<<"docs_written">>,44},
                                   {<<"data_replicated">>,21108}]}]}]},
                             0,44,48,0,[],48,
                             {doc,
                              <<"_local/156-1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              {1,<<27,24,13,111>>},
                              {[]},
                              0,false,[]},
                             {doc,
                              <<"_local/156-1a57639231dd745992cbe1736c7f1c8c/default/default">>,
                              {1,<<27,24,13,111>>},
                              {[]},
                              0,false,[]},
                             "Mon, 30 Dec 2013 16:31:47 GMT",
                             <<"1388400453481967">>,<<"1388400381">>,nil,
                             {1388,424429,531152},
                             {1388,423435,765637},
                             [],<0.9171.9>,
                             <<"d8a212c92cc7c43e6cb7d872a7a8c5bc">>,48,
                             false}}]}]
** Reason for termination ==
** {http_request_failed,"HEAD",
                        "https://Administrator:*****@10.3.4.175:18092/default%2f156%3bf52074c035e5ad86641a058dda01caaf/",
                        {error,{code,nil}}}


Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Alk, can you please take a quick look at comments at 30/Dec/13 12:55 PM?
Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Please assign back to me after you put your thoughts. Thanks.
Comment by Aleksey Kondratenko [ 31/Dec/13 ]
1) I cannot comment on anything without clear pointer to logs

2) I believe that conflating ssl tests and non-ssl tests is a big mistake. Open new bug if that's related to ssl.
Comment by Junyi Xie (Inactive) [ 31/Dec/13 ]
Sangharsh,

My suggestion:

1) Please provide the logs Alk required (if you did not keep the log, you need to reproduce the test that you see SSL errors but without any topology change).
2) Ask Alk suggested, please file a different bug for SSL issue.


Thanks.



Comment by Sangharsh Agarwal [ 01/Jan/14 ]
Junyi,
   This issue is occurring even before SSL feature merged. I have already provided reproduction logs 3 times. This issue is failing many test cases of XDCR (Around 10-12) continously in every execution. Please use above mentioned logs for analysis. Please let me know if anything is missing these logs. For XDCR tests we keep restart interval as 1 second for each test.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
Junyi,

In build 1028, 2 tests are failing due to this mutations not zero'ing out:

1).xdcr.biXDCR.bidirectional.xdcr.biXDCR.bidirectional.load_with_async_ops_and_joint_sets,doc-ops:create,GROUP:P0;xmem,demand_encryption:1,items:10000,case_number:1,conf_file:py-xdcr-bidirectional.conf,num_nodes:6,cluster_name:6-win-xdcr,ctopology:chain,rdirection:bidirection,ini:/tmp/6-win-xdcr.ini,doc-ops-dest:create,replication_type:xmem,get-cbcollect-info:True,spec:py-xdcr-bidirectional
2). xdcr.biXDCR.bidirectional.xdcr.biXDCR.bidirectional.load_with_async_ops_and_joint_sets_with_warmup,doc-ops:create-update,GROUP:P0;xmem,demand_encryption:1,items:10000,upd:30,case_number:6,conf_file:py-xdcr-bidirectional.conf,num_nodes:6,cluster_name:6-win-xdcr,ctopology:chain,rdirection:bidirection,ini:/tmp/6-win-xdcr.ini,doc-ops-dest:create-update,replication_type:xmem,get-cbcollect-info:True,spec:py-xdcr-bidirectional

Comment by Junyi Xie (Inactive) [ 08/Jan/14 ]
Maria, please upload or point me to the logs of these two failed test. Without logs, I cannot say anything.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
i'm collecting the logs... they are uploading now.
Comment by Maria McDuff (Inactive) [ 08/Jan/14 ]
junyi,

here are the logs: https://s3.amazonaws.com/bugdb/jira/MB-9707/mb9707.tgz
Comment by Junyi Xie (Inactive) [ 09/Jan/14 ]
Duplicate of MB-9745. Let us fix MB-9745 first.
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
MB-9745.
Comment by Maria McDuff (Inactive) [ 10/Jan/14 ]
Should be re-tested when MB-9745 is fixed.
Comment by Sangharsh Agarwal [ 17/Jan/14 ]
4 tests are failed because of this issue:

http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/37/consoleFull

[Logs for tests]
./testrunner -i /tmp/ubuntu-64-2.0-biXDCR-sanity.ini items=10000,get-cbcollect-info=True -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=destination,replication_type=xmem,GROUP=P0;xmem

[Test Steps]
1. FailureRestartInterval is 1 for this test.
2. Create SRC cluster with 3 nodes, and Destination cluster with 4 nodes.
3. Setup Bi-directional xmem non-encryption replication.
4. Load 10000 - 10000 items on both Source and Destination cluster.
5. Perform failover on destination side for non-master node.
6. Sleep for 30 seconds
7. Perform asynchronous updates and deletes (30%) on source side and updates on destination side.
8. Verification steps:
    i) Wait for curr_items, vb_active_curr_items on Source the side to be 17000 and ep_queue_size = 0.
    ii) Wait for replication_changes_left == 0 on destination side. -> Failed here.
    



[Improved logging on the test]
[2014-01-17 00:40:33,434] - [xdcrbasetests:652] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.1.3.96
[2014-01-17 00:40:33,699] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:33,701] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:40:43,850] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:43,851] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:40:54,143] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:40:54,144] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:04,344] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:04,346] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:14,625] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:14,628] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:24,889] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:24,890] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:35,228] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:35,229] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:45,570] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:45,572] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:41:55,796] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:41:55,803] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:06,026] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:06,028] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:16,202] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:16,203] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:26,499] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:26,501] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:36,695] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:36,696] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:46,869] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:46,870] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:42:57,104] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:42:57,105] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:07,478] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:07,479] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:17,743] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:17,749] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...
[2014-01-17 00:43:27,977] - [xdcrbasetests:661] INFO - Current outbound mutations on cluster node: 10.1.3.96 for bucket default is 10
[2014-01-17 00:43:27,978] - [xdcrbasetests:334] INFO - sleep for 10 secs. ...


[Logs]

Source ->

10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c7048a1d/10.1.3.93-1172014-044-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-9707/5788f308/10.1.3.94-1172014-045-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-9707/550388dd/10.1.3.95-1172014-046-diag.zip

Destination ->

10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-9707/75f497c4/10.1.3.96-1172014-047-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-9707/5742d86b/10.1.3.97-1172014-048-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-9707/a280fb85/10.1.3.99-1172014-049-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-9707/1e3c0b00/10.1.2.12-1172014-049-diag.zip


Check the logs after [2014-01-17 00:31:00,028]. It contains the logs for this test case only.

[user:info,2014-01-17T0:31:36.530,ns_1@10.1.3.93:<0.28274.18>:menelaus_web_remote_clusters:do_handle_remote_clusters_post:96]Created remote cluster reference "cluster1" via 10.1.3.96:8091.
[user:info,2014-01-17T0:31:36.632,ns_1@10.1.3.93:<0.28905.18>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.
[error_logger:info,2014-01-17T0:31:36.638,ns_1@10.1.3.93:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
Comment by Junyi Xie (Inactive) [ 17/Jan/14 ]
I repeat the test using my own VMs (2:3 configuration) and test passes. All items are verified correctly. See part of test output below.

From Sangharsg's recent logs, all items has been replicated and synced up on both sides. The non-zero in "outbound XDCR mutations" is likely a stats issue, which I suspected due to XDCR receives an incorrect vb map containing "dead vb" which however no longer belong to the node. This happens during topology change, that is why non-zero "outbound XDCR mutations is only seen on "Destination cluster" (which is also a source since it is a bi-directional replication).

I will continue investigation.


A couple of side comments

1. we still see db_not_found error when XDCR started up in test, though XDCR is able to recover.
2. the verification stage seems takes very long (> 12 min to verify 20K items) on one cluster.




2014-01-17 12:44:38 | INFO | MainProcess | load_gen_task | [task.has_next] 10000 items were verified
2014-01-17 12:54:55 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified
2014-01-17 12:54:55 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified in 656.043297052 sec.the average number of ops - 30.4857927225 per second

2014-01-17 12:55:05 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:55:29 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:55:37 | INFO | MainProcess | Cluster_Thread | [task.check] Saw curr_items 17000 == 17000 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2014-01-17 12:55:40 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:56:06 | INFO | MainProcess | Cluster_Thread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:56:25 | INFO | MainProcess | Cluster_Thread | [task.check] Saw vb_active_curr_items 17000 == 17000 expected on '10.3.3.101:8091''10.3.3.103:8091',default bucket
2014-01-17 12:56:47 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.101:11210 default
2014-01-17 12:57:12 | INFO | MainProcess | MainThread | [data_helper.direct_client] creating direct client 10.3.3.103:11210 default
2014-01-17 12:57:30 | INFO | MainProcess | MainThread | [task.__init__] 20000 items will be verified on default bucket
2014-01-17 12:57:30 | INFO | MainProcess | load_gen_task | [task.has_next] 0 items were verified
2014-01-17 12:59:15 | INFO | MainProcess | load_gen_task | [task.has_next] 10000 items were verified
2014-01-17 13:07:57 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified
2014-01-17 13:07:57 | INFO | MainProcess | load_gen_task | [task.has_next] 20000 items were verified in 627.347528934 sec.the average number of ops - 31.880256233 per second

2014-01-17 13:08:39 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests stats for test #1 load_with_failover ==============
2014-01-17 13:08:41 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] Type of run: BIDIRECTIONAL XDCR
2014-01-17 13:08:41 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.2.47 and destination at 10.3.3.101
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average local replica creation rate for bucket 'default': 8.00986759593 KB per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.005 K ops per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.0107190453205 K ops per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average XDCR data replication rate for bucket 'default': 7.94547104654 KB per second
2014-01-17 13:08:42 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] STATS with source at 10.3.3.101 and destination at 10.3.2.47
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Bucket: default
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average local replica creation rate for bucket 'default': 6.83460734451 KB per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Median XDC replication ops for bucket 'default': 0.003 K ops per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Mean XDC replication ops for bucket 'default': 0.010893736388 K ops per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests._print_stats] Average XDCR data replication rate for bucket 'default': 6.81691981277 KB per second
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== = = = = = = = = END = = = = = = = = = = ==============
2014-01-17 13:08:43 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was started for test #1 load_with_failover ==============

2014-01-17 13:09:29 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was finished for test #1 load_with_failover ==============
ok

----------------------------------------------------------------------
Ran 1 test in 3757.717s

OK
summary so far suite xdcr.biXDCR.bidirectional , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-14-Jan-17_12-06-51
Run after suite setup for xdcr.biXDCR.bidirectional.load_with_failover
-bash: xmem: command not found

Comment by Junyi Xie (Inactive) [ 17/Jan/14 ]
Sangharsh,

This is a stats issue. Here is action plan from scrub meeting.

1. Junyi will create a toybuild with fix, and ask Sangharsh to rerun the test to verify the fix.

2. If debugging takes longer than expected, because the stat is buggy now, Sangharsh can temporarily remove the checking this stat in verification code to allow test continue. Verification code in test will verify all data on both sides are consistent. This is just unblock the test and Sangharsh can add the stats back after the stats is fixed.
Comment by Wayne Siu [ 17/Jan/14 ]
Junyi, please update the ticket when the toybuild is available. Thanks.
Comment by Sangharsh Agarwal [ 19/Jan/14 ]
Junyi,
   I agree with the point on the scrub meeting, Please provide the toy build so that I can verify the fix.

Also, is it possible to add a debug log statement in the server code which can print the value of remaining outbound mutations in ns_server or xdcr logs?
Comment by Junyi Xie (Inactive) [ 19/Jan/14 ]
Sangharsh,

The toybuild is here:

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_cent58-2.5.0-toy-junyi-x86_64_2.5.0-MB9707A-toy.rpm

I rerun the test using the build on my own VMs (2:3 configuration). The test pass (it took quite long to finish though)


Junyis-MacBook-Pro:testrunner junyi$ ./testrunner -i ~/memo/vm/xmem2.ini items=10000,get-cbcollect-info=True -t xdcr.biXDCR.bidirectional.load_with_failover,replicas=1,items=10000,ctopology=chain,rdirection=bidirection,doc-ops=create-update-delete,doc-ops-dest=create-update,failover=destination,replication_type=xmem,GROUP=P0;


2014-01-19 23:09:31 | INFO | MainProcess | MainThread | [xdcrbasetests.tearDown] ============== XDCRbasetests cleanup was finished for test #1 load_with_failover ==============
ok

----------------------------------------------------------------------
Ran 1 test in 4251.595s

OK
summary so far suite xdcr.biXDCR.bidirectional , pass 1 , fail 0
testrunner logs, diags and results are available under logs/testrunner-14-Jan-19_21-58-39
Run after suite setup for xdcr.biXDCR.bidirectional.load_with_failover
Junyis-MacBook-Pro:testrunner junyi$


Comment by Sangharsh Agarwal [ 19/Jan/14 ]
Junyi,
   I will run the whole test suite with this toy build and will share the result with you.
Comment by Sangharsh Agarwal [ 20/Jan/14 ]
Junyi,
I have run the whole test suite with this toy build on my Vms. This issue not occurred. Can you please merge this fix?

Please brief the issue that you found and provided in this toybuild.
Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]

- This issue is highly likely to happen on Jenkens, but both QE and dev cannot reproduce it on their own VMs after trying multiple times. This makes diag of root cause very hard.

- At this time looks like the issue is a stats issue, instead of data replication issue.

- root cause is not 100% clear, looks like some upstream information (updated vb map during topology change) consumed by XDCR stats collections is not fully correct. For example, the new vb map received by XDCR from ns_server is not fully updated and contains dead vbuckets. As a result, XDCR stats code will aggregate the dead vbuckets and issue can happen.

- It is also possible this is caused by race conditions

- Junyi's fix in the toybuild made a defense line to check that stats code only aggregates active vbuckets.


Per discussion with Sangharsh this morning, action items:

Sangharsh:
1) will modify the test to prevent stat checking from crashing the test, instead, the failed stat check should be logged and test should continue with other verification e.g., verify data items on both sides;
2) rerun a set of XDCR tests using the toybuild on his own VM. (By Sangharsh, Jenkins cannot run toybuild at this time). This is because it is quite hard to reproduce by Sangharsh on this own VMs, test the toybuild once is probably not enough to verify that the fix works.


Junyi:
push the fix in toybuild to gerrit and start review.

Comment by Sangharsh Agarwal [ 20/Jan/14 ]
I have modified the test and created review http://review.couchbase.org/#/c/32662/
Comment by Andrei Baranouski [ 20/Jan/14 ]
I can't approve the changes:

1) this is done in a general method.
2) This is tantamount to remove this verification for all tests.
3) if replication_changes_left is buggy we spend a lot of time on meaningless test of this(how often it's reproduced?)

@Sangharsh, why Jenkins cannot run toybuild at this time. if it's so we should fix it
as workaround, you can install toy builds manually and run only tests on jenkis

my opinion is that the test should fall. Another thing that we can failed test with corresponding message after all other checks, this can be done like this:

__wait_for_mutation_to_replicate return boolean (false - Timeout occurs)
get the value in verify_xdcr_stats and based on the results to determine the final status of the test






Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
The commit in the toybuild is now pending review from ns_server team

http://review.couchbase.org/#/c/32663/
Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
Hi Andrei,

If the stat is not zero, the test should fail but it should not crash in the middle when this stats checking time out. The correct behavior is that test should finish all verifications, and at the end of test given all verification results it determines if the test should pass or fail.

Comment by Junyi Xie (Inactive) [ 20/Jan/14 ]
Per bug scrub meeting, we will convert it to a doc bug. Here is the description.

Maria,

Let me know if I miss anything.



"Users may see incorrect stat "Outbound mutations" after topology change at source side. If all XDCR activity has settled down and data have been replicated, "Outbound mutations" stat should see 0, meaning no remaining mutations to be replicated. Due to race condition, "Outbound mutations" may contain stats from "dead vbuckets" that were active before rebalance but have been migrated to other nodes during rebalance. If users hit this issue, "Outbound mutations" may show non-zero stat even after all data are replicated. User can verify the data on both sides by checking number of items in source and destination bucket on both sides.

Stop/restart XDCR should refresh all stats and if all data have been replicated, at incoming XDCR stats at destination side, no set and delete operations will be seen, metadata operations will be seen though."



Comment by Maria McDuff (Inactive) [ 20/Jan/14 ]
Cloned doc bug: MB-9960
Comment by Andrei Baranouski [ 21/Jan/14 ]
Hi Junyi,

"If the stat is not zero, the test should fail but it should not crash in the middle when this stats checking time out. The correct behavior is that test should finish all verifications, and at the end of test given all verification results it determines if the test should pass or fail."

completely agree. this is what I meant.

Sangharsh, let's implement this approach for such cases. let me know if I can do anything to help
Comment by Sangharsh Agarwal [ 21/Jan/14 ]
Andrei,
  I have uploaded the updated changes. Please review.

http://review.couchbase.org/#/c/32662/
Comment by Sangharsh Agarwal [ 21/Jan/14 ]
I have started Jenkin jobs on Toy build now with changes in the test code:

http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/39/console

Comment by Sangharsh Agarwal [ 22/Jan/14 ]
Juyi,
   Result of execution http://qa.sc.couchbase.com/job/centos_x64--01_01--uniXDCR_biXDCR-P0/39/consoleFull with toy build:

Results are unpredictable:

Failed test: 17
Passed :3

All the failed tests failed with wrong number of items on the destination and server and test is time-out.
Comment by Sangharsh Agarwal [ 22/Jan/14 ]
In this attached snapshot, Right hand side, it is showing that replication is configured, but on the left side there is no tab for "Outgoing replication" for the bucket.
Comment by Sangharsh Agarwal [ 22/Jan/14 ]
There are two issues I have observerd:

1. Replication status was "Starting up" on 10.1.3.93 -> 10.1.3.96 for a longer time and no replication was taking place. Please find the below logs.

[SRC]
10.1.3.93 : https://s3.amazonaws.com/bugdb/jira/MB-9707/332ea7a9/10.1.3.93-1212014-2144-diag.zip
10.1.3.94 : https://s3.amazonaws.com/bugdb/jira/MB-9707/55d57001/10.1.3.94-1212014-2146-diag.zip
10.1.3.95 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c0f467b5/10.1.3.95-1212014-2145-diag.zip

[DEST]
10.1.3.96 : https://s3.amazonaws.com/bugdb/jira/MB-9707/c0844e93/10.1.3.96-1212014-2148-diag.zip
10.1.3.97 : https://s3.amazonaws.com/bugdb/jira/MB-9707/02eca96e/10.1.3.97-1212014-2147-diag.zip
10.1.3.99 : https://s3.amazonaws.com/bugdb/jira/MB-9707/e43d699f/10.1.3.99-1212014-2150-diag.zip
10.1.2.12 : https://s3.amazonaws.com/bugdb/jira/MB-9707/eac5f09d/10.1.2.12-1212014-2151-diag.zip


There is one replication created from 10.1.3.93 -> 10.1.3.96 at 21:32:30.432 but it doesn't cause replication:

[user:info,2014-01-21T21:32:30.432,ns_1@10.1.3.93:<0.2181.0>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.

[xdcr:error,2014-01-21T21:32:31.105,ns_1@10.1.3.93:<0.5811.0>:xdc_vbucket_rep:terminate:377]Shutting xdcr vb replicator ({init_state,
                              {rep,
                               <<"a85361a0f85b08e631930ad342bb1171/default/default">>,
                               <<"default">>,
                               <<"/remoteClusters/a85361a0f85b08e631930ad342bb1171/buckets/default">>,
                               "xmem",
                               [{max_concurrent_reps,32},
                                {checkpoint_interval,1800},
                                {doc_batch_size_kb,2048},
                                {failure_restart_interval,1},
                                {worker_batch_size,500},
                                {connection_timeout,180},
                                {worker_processes,4},
                                {http_connections,20},
                                {retries_per_request,2},
                                {optimistic_replication_threshold,256},
                                {xmem_worker,1},
                                {enable_pipeline_ops,true},
                                {local_conflict_resolution,false},
                                {socket_options,
                                 [{keepalive,true},{nodelay,false}]},
                                {supervisor_max_r,25},
                                {supervisor_max_t,5},
                                {trace_dump_invprob,1000}]},
                              62,"xmem",<0.5411.0>,<0.5412.0>,<0.5408.0>}) down without ever successfully initializing: shutdown

Also, in ns_server.xdcr.log on 10.1.3.93 doesn't have logs between 21:32:31.105 and 21:41:08, It means XDCR was not running at that time.

Then I have deleted and re-create the replication from 10.1.3.93 -> 10.3.1.96 and it started the replication:

[user:info,2014-01-21T21:41:17.165,ns_1@10.1.3.93:<0.2852.2>:menelaus_web_xdc_replications:handle_create_replication:50]Replication from bucket "default" to bucket "default" on cluster "cluster1" created.


Please see if it is not caused by changes in toy build.
Comment by Junyi Xie (Inactive) [ 22/Jan/14 ]
That because a bunch of db_not_found errrors around 21:32:30. The timing in the test does not work on Jenkins.

Both the toybuild and regular build works well in standalone test, you also tried several times. I do not understand why it always failed on Jenkins



[error_logger:error,2014-01-21T21:32:30.708,ns_1@10.1.3.93:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: xdc_vbucket_rep:init/1
    pid: <0.5431.0>
    registered_name: []
    exception exit: {db_not_found,<<"http://Administrator:*****@10.1.2.12:8092/default%2f38%3b2de080ec0a0409811c3560b4779092f1/">>}
      in function gen_server:terminate/6
    ancestors: [<0.5413.0>,<0.5408.0>,xdc_replication_sup,ns_server_sup,
                  ns_server_cluster_sup,<0.59.0>]
    messages: []
    links: [<0.5413.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 16332
  neighbours:
Comment by Dipti Borkar [ 22/Jan/14 ]
Junyi,


The information you added above, does not help a user or support verify that this is not a data problem but a stats problem.
Please add information here about how to verify that the # of documents match or there is no data to replicate.
Comment by Junyi Xie (Inactive) [ 23/Jan/14 ]
Dipti,

I do not fully understand what you really need to verify the data. Users can always look at the count of items at both side to check if they matching. To further determine if there any mutation not replicated, users can

1) read data from both sides and match them, like verification code in test
2) as I said, users can restart XDCR, and no sets will be seen if data already replicated, which is probably the easiest way.

At this time I am not aware of other solutions.
Comment by Aleksey Kondratenko [ 24/Jan/14 ]
Verified item counts stats in logs just in case. Item counts match. So does look like stats problem (unless test is doing some mutations in which case simply comparing item counts is not right way to see if data is indeed replicated).

I believe QE should add verification pass where they do tap/upr or even _all_docs to actually get all keys alive at source and destination cluster. And then they can (and should) actually GET all those keys and compare values. Even better would be to also compare metadata (seqnos, cas, expirations etc).

Otherwise we'll keep having those useless conversations of whether we're really sure there is or there is no data loss.
Comment by Dipti Borkar [ 24/Jan/14 ]
Completely agree. which is why I had asked the Junyi to explain what is the best way to verify that no data loss has occured. what ep_engine stats or other stats can QE look at to validate no data loss.

Alk, so you suggest _all_docs?

Someone from QE, please work with Alk and Junyi, on how to verify this.
Comment by Aleksey Kondratenko [ 24/Jan/14 ]
>> Alk, so you suggest _all_docs?

Unfortunately I cannot really suggest _all_docs. We're supposed to kill it in few weeks. And replacement will not support streaming all bucket's documents/keys.

_all_docs is probably easiest to consume today. But looking forward it looks like we'll have to use tap or upr for that. Or even couch_dbdump.
Comment by Sangharsh Agarwal [ 27/Jan/14 ]
Alk,
Current Verification process:

1. Verify Stats on Source and Destination Cluster:
        -> ep_queue_size == 0
        -> curr_items == Num of items on the cluster.
        -> vb_active_items == Num of items on the cluster.
        -> replication_changes_left == 0
2. Verify Data (Key, Value) on both Source and Destination Cluster.

Is _all_docs defers from the above check?
Comment by Andrei Baranouski [ 27/Jan/14 ]
+ we also _verify_revIds in our tests
Comment by Aleksey Kondratenko [ 27/Jan/14 ]
Thanks, Andrei and Sangharsh.

But how come we're _debating_ whether there's data loss at all if your tests already do all required verification? Perhaps you should in all xdcr related tickets _clearly_ state if test detected actual data loss or not. For extra clarity.
Comment by Andrei Baranouski [ 27/Jan/14 ]
no data loss, the problem is in outbound mutations stats
Comment by Aleksey Kondratenko [ 27/Jan/14 ]
Great. "In Andrei I trust" :)
Comment by Cihan Biyikoglu [ 06/Feb/14 ]
based on the last comment - if there isn't data loss, should we still consider this a blocker?
thanks
-cihan
Comment by Maria McDuff (Inactive) [ 10/Mar/14 ]
Raising as a Blocker as it is failing most of the QE tests.
Comment by Aleksey Kondratenko [ 10/Mar/14 ]
Cannot agree with Blocker. QE tests are _supposed_ to check every doc. Therefore QE tests can handle bad stats.

If issue is not due to bad stats, then it's a _different issue_.
Comment by Aleksey Kondratenko [ 10/Mar/14 ]
If it's blocking some tests it appears to be problem with tests as I've mentioned above.

If that's new bug then it requires new ticket.

In any case it requires more coordination.
Comment by Aruna Piravi [ 12/Mar/14 ]
>Cannot agree with Blocker. QE tests are _supposed_ to check every doc. Therefore QE tests can handle bad stats.

QE tests do check every doc. However _any_ verification starts only when we know replication is complete. And it is this stat "replication_changes_left" we heavily rely on to know if replication has come to a stop. So it is not right to say QE tests can handle bad stats. When this stat doesn't become 0 after replication, our tests timeout.

Even in testing pause and resume, we rely on this stat to check if active_vbreps/incoming xdcr ops on remote cluster will go up after replication is resumed. If stats are buggy, our tests can fail for no good reason.

We are open for coordination but it's not a new bug, tests have been failing for the last few months because of this stat and hence this is a blocker from a QE perspective.
Comment by Aleksey Kondratenko [ 12/Mar/14 ]
Here's suggestion (I was assuming it's obvious): stop testing stats. Only test actual docs.

Do I understand correctly that it'll unblock your month-long blocked testing ?
Comment by Sangharsh Agarwal [ 12/Mar/14 ]
Here, We don't test this stat, while we check this stat to ensure of there is no outbound mutation and we can proceed for further validations of docs key, value, revid etc. Though in 3.0 we are planning to test some of the stats for Pause and resume feature (After Pause and resume)
Comment by Aruna Piravi [ 12/Mar/14 ]
It would be meaningful to start testing docs only after replication is complete, right? Would it make sense otherwise? And the test has been blocked for more than 3-4 months now.
Comment by Andrei Baranouski [ 12/Mar/14 ]
my suggestion is do not fail test for now(and I think Alk meant it) if mutation is still replicated after a long time
https://github.com/couchbase/testrunner/blob/master/pytests/xdcr/xdcrbasetests.py#L733

another option we could implement the ability to identify the bug. when 'outbound mutations' is not changed and non-zero

according to comments I see it happen mostly with failover/warmup scenarios and Pause and resume feature should work as expected in basic cases.
Comment by Andrei Baranouski [ 14/May/14 ]
Sangharsh, could you check in XDCR test logs for some jobs/runs that we still see the issue with timeoutError in Outbound mutations in 3.0
Comment by Sangharsh Agarwal [ 14/May/14 ]
Tests are not stable yet on 3.0, I will update it once stable.
Comment by Sangharsh Agarwal [ 28/May/14 ]
I am still seeing this issue on 3.0.
Comment by Andrei Baranouski [ 28/May/14 ]
I see, could you set build version where you still see it?

@Alk, I think we still ignore checking this stats?(outbound replication mutations != 0)
Comment by Aleksey Kondratenko [ 28/May/14 ]
Andrei, can you elaborate on your question?
Comment by Andrei Baranouski [ 29/May/14 ]
CBQE-2280 - ticket to create separate tests for stats verification

Sangharsh, could you provide complete information on the current problem: build, test, steps, logs, collect_info and then assign it on Alk

Comment by Sangharsh Agarwal [ 29/May/14 ]
Andrei,

>Sangharsh, could you provide complete information on the current problem: build, test, steps, logs, collect_info and then assign it on Alk

I think, original issue is still not fixed. Do you think XDCR UPR will fix this issue?
Comment by Andrei Baranouski [ 29/May/14 ]
Sangharsh, do you expect that devs will study the logs on the old version?
Comment by Sangharsh Agarwal [ 29/May/14 ]
No, I really don't want this. If you can see the history of this bug. It is re-produced 5-6 times and every time logs were posted. If there is any improvement in logs/product which can help in analyzing this issue, then I think, it is advisable to upload the news logs.

Anyways:

Build: 721 (Upgrade tests)

XDCR UPR is used after upgrade.


[Jenkins]
http://qa.hq.northscale.net/job/centos_x64--104_01--XDCR_upgrade-P1/5/consoleFull

[Test]
./testrunner -i centos_x64--104_01--XDCR_upgrade-P1.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,get-coredumps=True,upgrade_version=3.0.0-721-rel -t xdcr.upgradeXDCR.UpgradeTests.online_cluster_upgrade,initial_version=2.0.0-1976-rel,sdata=False,bucket_topology=default:1>2;standard_bucket0:1<2;bucket0:1><2,post-upgrade-actions=src-rebalancein;dest-rebalanceout;dest-create_index


[Number of tests have this issue] (Seach with string "Timeout occurs while waiting for mutations to be replicated" on the above link)
6

[Test Steps]
1. Setup 2.0 Source and Destination nodes with 2-2 nodes
2. XDCR, capi mode:
     bucket0 <-> bucket0 (Load 1000 items on each side)
     default -> default (Load 1000 items on Source)
     standard_bucket0 <-- standard_bucket0 (Load 1000 items on destination)
3. Upgrade nodes to 3.0.0-721-rel.
4. Perform mutations on each nodes. (update and deletes)
5. Rebalance in and Rebalance out one node at Source and Destination nodes respectively.
6. Verify stats on both side.
 

[Test Logs]
2014-05-23 11:09:04,639] - [task:1054] INFO - 3000 items were verified in 2.99595117569 sec.the average number of ops - 1001.3507945 per second
[2014-05-23 11:09:04,639] - [xdcrbasetests:1332] INFO - Waiting for Outbound mutation to be zero on cluster node: 10.3.3.126
[2014-05-23 11:09:04,762] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:04,863] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:04,864] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:14,972] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:15,067] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:15,068] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:25,181] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:25,281] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:25,282] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:35,399] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:35,500] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:35,501] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:45,614] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:45,720] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:45,721] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:09:55,835] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:09:55,934] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:09:55,935] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:06,047] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:06,146] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:06,147] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:16,263] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:16,363] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:16,364] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:26,476] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:26,577] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:26,578] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:36,692] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:36,795] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:36,796] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:46,916] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:47,023] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:47,024] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:10:57,134] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:10:57,235] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:10:57,236] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:07,314] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:07,414] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:07,416] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:17,523] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:17,628] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:17,639] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:27,752] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:27,853] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:27,854] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:37,969] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:38,075] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:38,076] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:48,188] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:48,289] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:48,290] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:11:58,403] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket bucket0 is 1
[2014-05-23 11:11:58,506] - [xdcrbasetests:1341] INFO - Current outbound mutations on cluster node: 10.3.3.126 for bucket default is 0
[2014-05-23 11:11:58,507] - [basetestcase:255] INFO - sleep for 10 secs. ...
[2014-05-23 11:12:08,518] - [xdcrbasetests:1351] ERROR - Timeout occurs while waiting for mutations to be replicated


[Logs]

[Source]
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-9707/3dc2d4a2/10.3.3.126-5232014-2223-diag.zip
10.3.3.126 : https://s3.amazonaws.com/bugdb/jira/MB-9707/ce380435/10.3.3.126-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-9707/b122888f/10.3.5.61-diag.txt.gz
10.3.5.61 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f2dd5236/10.3.5.61-5232014-2226-diag.zip


[Destination]
10.3.121.199 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f4d0554b/10.3.121.199-5232014-2231-diag.zip
10.3.121.199 : https://s3.amazonaws.com/bugdb/jira/MB-9707/f989626e/10.3.121.199-diag.txt.gz
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-9707/86246c34/10.3.5.11-diag.txt.gz
10.3.5.11 : https://s3.amazonaws.com/bugdb/jira/MB-9707/fe85afac/10.3.5.11-5232014-2229-diag.zip


[Node added on Source]
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-9707/19ef5b23/10.3.5.60-diag.txt.gz
10.3.5.60 : https://s3.amazonaws.com/bugdb/jira/MB-9707/b708d7af/10.3.5.60-5232014-2233-diag.zip

Comment by Andrei Baranouski [ 29/May/14 ]
at the moment it is only required to move forward with this bug
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
MB-7432




[MB-11570] XDCR checkpointing: Increment num_failedckpts stat when checkpointing fails with 404 error Created: 26/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centOS

Triage: Untriaged
Is this a Regression?: No

 Description   
Scenario
------------
Do a failover at destination
The next immediate checkpoint on a failed over vbucket will fail with error 404
However you will not notice any change in the last 10 failed checkpoints per node stat on GUI. This is not the case with error code 400.


Can we increment this stat on 404 errors also? Pls let me know if you need logs.

 Comments   
Comment by Aleksey Kondratenko [ 26/Jun/14 ]
Yes I need logs.
Comment by Aruna Piravi [ 26/Jun/14 ]
Same set of logs as in MB-11571.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aleksey Kondratenko [ 18/Jul/14 ]
This is actually because this code can only track last 10 checkpoints per node. Not something new and I'm not sure if I'll bother enough to fix this.
Comment by Aruna Piravi [ 25/Jul/14 ]
Yes, it tracks only last 10 checkpoints per node so I'm not going to push for a fix. It might help with some unit testing in future. Also, I filed this MB after consulting you. So I will leave it to you.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
actually I've already fixed it in my work tree as part of stats work that I'll submit under MB-7432




[MB-11797] Rebalance-out hangs during Rebalance + Views operation in DGM run Created: 23/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, ns_server, view-engine
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Meenakshi Goel Assignee: Aleksey Kondratenko
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-973-rel

Attachments: Text File logs.txt    
Triage: Triaged
Operating System: Ubuntu 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Link:
http://qa.sc.couchbase.com/job/ubuntu_x64--65_02--view_query_extended-P1/145/consoleFull

Test to Reproduce:
./testrunner -i /tmp/ubuntu12-view6node.ini get-delays=True,get-cbcollect-info=True -t view.createdeleteview.CreateDeleteViewTests.incremental_rebalance_out_with_ddoc_ops,ddoc_ops=create,test_with_view=True,num_ddocs=2,num_views_per_ddoc=3,items=200000,active_resident_threshold=10,dgm_run=True,eviction_policy=fullEviction

Steps to Reproduce:
1. Setup 5-node cluster
2. Create default bucket
3. Load 200000 items
4. Load bucket to achieve dgm 10%
5. Create Views
6. Start ddoc + Rebalance out operations in parallel

Please refer attached log file "logs.txt".

Uploading Logs:


 Comments   
Comment by Meenakshi Goel [ 23/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/8586d8eb/172.23.106.201-7222014-2350-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/ea5d5a3f/172.23.106.199-7222014-2354-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d06d7861/172.23.106.200-7222014-2355-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/65653f65/172.23.106.198-7222014-2353-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/dd05a054/172.23.106.197-7222014-2352-diag.zip
Comment by Sriram Melkote [ 23/Jul/14 ]
Nimish - to my eyes, it looks like views are not involved in this failure. Can you please take a look at the detailed log and assign to Alk if you agree? Thanks
Comment by Nimish Gupta [ 23/Jul/14 ]
From the logs:

[couchdb:info,2014-07-22T14:47:21.345,ns_1@172.23.106.199:<0.17993.2>:couch_log:info:39]Set view `default`, replica (prod) group `_design/dev_ddoc40`, signature `c018b62ae9eab43522a3d0c43ac48b3e`, terminating with reason: {upr_died,
                                                                                                                                       {bad_return_value,
                                                                                                                                        {stop,
                                                                                                                                         sasl_auth_failed}}}

One obvious problem is that we returned the wrong number of parameter for stop when sasl auth failed. That I have fixed, and is under review.(http://review.couchbase.org/#/c/39735/).

I don't know the reason why sasl auth failed or it may be normal for sasl auth to fail during rebalance. Meenakshi, could you please run the test again after this change is merged.
Comment by Nimish Gupta [ 23/Jul/14 ]
Trond has added code to log more information for sasl errors in memcached (http://review.couchbase.org/#/c/39738/). It will be helpful to debug sasl errors.
Comment by Meenakshi Goel [ 24/Jul/14 ]
Issue is reproducible with latest build 3.0.0-1020-rel.
http://qa.sc.couchbase.com/job/ubuntu_x64--65_03--view_dgm_tests-P1/99/consoleFull
Uploading Logs shortly.
Comment by Meenakshi Goel [ 24/Jul/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-11797/13f68e9c/172.23.106.186-7242014-1238-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/c0cf8496/172.23.106.187-7242014-1239-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/77b2fb50/172.23.106.188-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/d0335545/172.23.106.189-7242014-1240-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-11797/7634b520/172.23.106.190-7242014-1241-diag.zip
Comment by Nimish Gupta [ 24/Jul/14 ]
From the ns_server logs, It looks to me memcached has crashed.

[error_logger:error,2014-07-24T12:28:36.305,ns_1@172.23.106.186:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_memcached:init/1
    pid: <0.693.0>
    registered_name: []
    exception exit: {badmatch,{error,closed}}
      in function gen_server:init_it/6 (gen_server.erl, line 328)
    ancestors: ['single_bucket_sup-default',<0.675.0>]
    messages: []
    links: [<0.717.0>,<0.719.0>,<0.720.0>,<0.277.0>,<0.676.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 75113
    stack_size: 27
    reductions: 26397931
  neighbours:

Ep-engine/ns_server team please take a look.
Comment by Nimish Gupta [ 24/Jul/14 ]
From the logs:

** Reason for termination ==
** {unexpected_exit,
       {'EXIT',<0.31044.9>,
           {{{badmatch,{error,closed}},
             {gen_server,call,
                 ['ns_memcached-default',
                  {get_dcp_docs_estimate,321,
                      "replication:ns_1@172.23.106.187->ns_1@172.23.106.188:default"},
                  180000]}},
            {gen_server,call,
                [{'janitor_agent-default','ns_1@172.23.106.187'},
                 {if_rebalance,<0.15733.9>,
                     {wait_dcp_data_move,['ns_1@172.23.106.188'],321}},
                 infinity]}}}}
Comment by Sriram Melkote [ 25/Jul/14 ]
Alk, can you please take a look? Thanks!
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Quick hint for fellow coworkers. When you see connection closed usually first thing to check is if memcached has crashed. And in this case indeed it has (diag's cluster wide logs is perfect place to find this issues):

2014-07-24 12:28:35.861 ns_log:0:info:message(ns_1@172.23.106.186) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:09:47.941525 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 650) stream created with start seqno 5794 and end seqno 18446744073709551615
Thu Jul 24 12:09:49.115570 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 749, cookie 0x606f800
Thu Jul 24 12:09:49.380310 PDT 3: (default) Notified the completion of checkpoint persistence for vbucket 648, cookie 0x6070d00
Thu Jul 24 12:09:49.450869 PDT 3: (default) UPR (Consumer) eq_uprq:replication:ns_1@172.23.106.189->ns_1@172.23.106.186:default - (vb 648) Attempting to add takeover stream with start seqno 5463, end seqno 18446744073709551615, vbucket uuid 35529072769610, snap start seqno 5463, and snap end seqno 5463
Thu Jul 24 12:09:49.495674 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.186->ns_1@172.23.106.187:default - (vb 648) stream created with start seqno 5463 and end seqno 18446744073709551615
2014-07-24 12:28:36.302 ns_memcached:0:info:message(ns_1@172.23.106.186) - Control connection to memcached on 'ns_1@172.23.106.186' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_memcached:0:info:message(ns_1@172.23.106.187) - Control connection to memcached on 'ns_1@172.23.106.187' disconnected: {badmatch,
                                                                        {error,
                                                                         closed}}
2014-07-24 12:28:36.756 ns_log:0:info:message(ns_1@172.23.106.187) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting. Messages: Thu Jul 24 12:28:35.860224 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1019) Stream closing, 0 items sent from disk, 0 items sent from memory, 5781 was last seqno sent
Thu Jul 24 12:28:35.860235 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1020) Stream closing, 0 items sent from disk, 0 items sent from memory, 5879 was last seqno sent
Thu Jul 24 12:28:35.860246 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1021) Stream closing, 0 items sent from disk, 0 items sent from memory, 5772 was last seqno sent
Thu Jul 24 12:28:35.860256 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1022) Stream closing, 0 items sent from disk, 0 items sent from memory, 5427 was last seqno sent
Thu Jul 24 12:28:35.860266 PDT 3: (default) UPR (Producer) eq_uprq:replication:ns_1@172.23.106.187->ns_1@172.23.106.186:default - (vb 1023) Stream closing, 0 items sent from disk, 0 items sent from memory, 5480 was last seqno sent

Status 137 is 128 (death by signal (set by kernel)) + 9. So signal 9. dmesg (captured in couchbase.log) does not have signs of OOM. This means - humans :) Not the first and sadly not the last time something like this happens. Rogue scripts, bad tests etc.
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Also we should stop the practice if reusing tickets for unrelated conditions. This doesn't look anywhere close to rebalance hang isnt?
Comment by Aleksey Kondratenko [ 25/Jul/14 ]
Not sure what to do about this one. Closing as incomplete will probably not hurt.




[MB-10376] XDCR Pause and Resume : Pausing during rebalance-in does not flush XDCR queue on all nodes Created: 05/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 10.3.4.186-352014-1846-diag.zip     Zip Archive 10.3.4.187-352014-1849-diag.zip     Zip Archive 10.3.4.188-352014-1851-diag.zip     PNG File Screen Shot 2014-03-05 at 6.37.50 PM.png     PNG File Screen Shot 2014-03-05 at 6.38.12 PM.png    
Triage: Triaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
Build
-------
3.0.0- 400

Scenario
--------------
- 2 one-node clusters , create uni-xdcr (checkpoint interval = 60secs)
- Add in one node on source, rebalance
- Pause replication while rebalance-in is in progress
- Observe outbound xdcr stats for the bucket
- rebalance-in is stuck (MB-10385)

While active_vbreps immediately goes down to 0, the xdcr queue is not flushed even after 20 mins(which I think is a bug in itself).
By XDCR queue I mean docs_rep_queue and size_rep_queue.
In the screenshot you will observe that on source nodes-
docs_rep_queue on 10.3.4.186 is 199
docs_rep_queue on 10.3.4.187 is 0.

Note:
10.3.4.186 is the master, .187 is the new node.
active_vbreps anyway drops to zero on both nodes causing replication to stop. I'm not sure if the unflushed queue and stuck rebalance could be related, just sharing my observation.

Attached
--------------
Screenshot and cbcollect from .186, .187 (source) and .188 (target)


 Comments   
Comment by Aleksey Kondratenko [ 06/Mar/14 ]
Is there any reason to believe that xdcr affects this case at all ?
Comment by Aruna Piravi [ 06/Mar/14 ]
Well, when I started writing this bug report, it was basically to draw attention to the fact that xdcr queue was not getting flushed on one of the two nodes. Later I noted that rebalance was stuck on the same node which was a more serious issue. I was not sure if they are related.

To check that,
I resumed replication- rebalance was still stuck.
I deleted replication - rebalance was still stuck
I stopped and started rebalance again - stuck at 0%

If rebalance-in one node on a one node cluster fails, it must either be a regression or something related to xdcr operations I was performing in parallel(creating,deleting, pausing and resuming replications). I'm also not sure if rebalance uses UPR yet. And maybe I should open two separate bugs for xdcr queue not flushed and rebalance getting stuck?

Comment by Aleksey Kondratenko [ 06/Mar/14 ]
rebalance being stuck is likely duplicated somewhere.

Rest of bug description looks like stats bug.
Comment by Aleksey Kondratenko [ 06/Mar/14 ]
Rebalance is not using upr yet. But there are already some regressions AFAIK
Comment by Aruna Piravi [ 06/Mar/14 ]
I see a bunch of rebalance stuck issues which ep-engine says is a result of TAP/UPR refactoring. Some related to memory leaks in TAP and checkpoints waiting to be persisted(already closed). I'm not sure what is causing it here so it would be good if ns_server or ep_engine takes a look at the logs. Filing a separate bug - MB-10385 . Please feel free to close as duplicate if logs give a reason to believe so.

We can use this issue to track the xdcr unflushed queue or stats problem.
Comment by Anil Kumar [ 19/Jun/14 ]
Triage - June 19 2014 Alk, Parag, Anil
Comment by Aruna Piravi [ 25/Jul/14 ]
Finding this fixed.Closing this issue




[MB-11616] Rebalance not available 'pending add rebalanace' while loading data Created: 02/Jul/14  Updated: 25/Jul/14

Status: In Progress
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Major
Reporter: Anil Kumar Assignee: Pavel Blagodov
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-07-02 at 11.23.16 AM.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
4 node cluster
scenario

1. Loading data in single node (only one node at this time)
2. Data loading starts (progress is shown)
3. Add new nodes to cluster (3 more nodes added)
4. But Rebalance is not available until data loading is complete

Screenshot attached

Expected -

On Pending Rebalance - warning message "Rebalance is not available until data loading is completed"

 Comments   
Comment by Aleksey Kondratenko [ 02/Jul/14 ]
You forgot to mention that it's sample data loading.

We explicitly prevent rebalance in this case because samples loader is unable to deal with rebalance. There was bug fixed as part of 3.0 work.
Comment by Anil Kumar [ 07/Jul/14 ]
Got it. Please check the expected section we need some message to user.
Comment by Aleksey Kondratenko [ 07/Jul/14 ]
Please adapt ui to show this message if data loading task is present.
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Anil, Wayne, Parag, Tony .. July 17th
Comment by Pavel Blagodov [ 18/Jul/14 ]
http://review.couchbase.org/39533




[MB-10680] XDCR Pause/Resume: Resume(during rebalance-in) causes replication status of existing replications of target cluster(which has failed over node) to go to "starting up" mode Created: 27/Mar/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0, 3.0-Beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Aruna Piravi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4GB RAM, 4 core VMs

Attachments: Zip Archive 172.23.106.209-3272014-186-diag.zip     Zip Archive 172.23.106.45-3272014-180-diag.zip     Zip Archive 172.23.106.46-3272014-182-diag.zip     Zip Archive 172.23.106.47-3272014-183-diag.zip     Zip Archive 172.23.106.48-3272014-185-diag.zip     PNG File Screen Shot 2014-03-27 at 5.28.24 PM.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
Scenario
--------------
1. Create two clusters with 2 nodes each and create bi-xdcr on 2 buckets. Load data, watch replication. Pause all replications at C1. C2 continues to replicate to C1.
2. Rebalance-in one node at cluster1 while failing over one node and rebalancing it out at cluster2. Resume all replications at C1.
3. Notice that on cluster2, all ongoing replications go from "replicating" to "starting up" mode and there's no outbound replication category for any of the cluster buckets.

Setup
--------
[Cluster1]
172.23.106.45
172.23.106.46 <--- 172.23.106.209 [rebalance-in]

[Cluster2]
172.23.106.47
172.23.106.48 ---> failover and rebalance-out

Reproducible?
---------------------
Yes, consistently, tried thrice.

Attached
--------------
cbcollect info and screenshot

Script
--------
./testrunner -i /tmp/bixdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=bidirection,ctopology=chain,sasl_buckets=1,rebalance_in=source,rebalance_out=destination,failover=destination,pause=source



Will scale down to one replication and also try with xmem.

 Comments   
Comment by Aruna Piravi [ 28/Mar/14 ]
Not seen with XMEM and just 1 bi-xdcr.
Comment by Aruna Piravi [ 28/Mar/14 ]
However seen with XMEM and 2 bi-xdcrs. Also there's no xdcr activity between the clusters after the replication status changes to "starting up" on one cluster.

Mem usage was between 20-30% (unusual for 2 bi-xdcrs but justified due to lack of xdcr activity)

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11013 couchbas 20 0 864m 303m 6408 S 45.4 7.8 68:08.01 memcached
10960 couchbas 20 0 2392m 1.0g 39m S 13.3 27.4 178:20.21 beam.smp

Also not seen with CAPI and just 1 bi-xdcr. 2 replications in a 4GB RAM could be a reason. Let me try on bigger VMs and get back.
Comment by Aruna Piravi [ 28/Mar/14 ]
Reproduced on VMs with15GB RAM. Related to Pause and Resume. Resuming replications at a cluster(which is also rebalancing in node) kills xdcr at remote cluster(which is rebalancing out).
Comment by Aruna Piravi [ 23/Apr/14 ]
Any update on this bug?
Comment by Anil Kumar [ 17/Jul/14 ]
Triage - Alk, Aruna, Anil, Wayne .. July 17th
Comment by Aruna Piravi [ 25/Jul/14 ]
Finding this fixed in latest builds. Closing this MB. Thanks




[MB-11819] XDCR: Rebalance at destination hangs, missing replica items Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket, cross-datacenter-replication
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Test Blocker
Reporter: Aruna Piravi Assignee: Mike Wiederhold
Resolution: Duplicate Votes: 0
Labels: rebalance-hang
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 172.23.106.45-7242014-208-diag.zip     Zip Archive 172.23.106.46-7242014-2010-diag.zip     Zip Archive 172.23.106.47-7242014-2011-diag.zip     Zip Archive 172.23.106.48-7242014-2013-diag.zip    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Build
-------
3.0.0-1014

Scenario
------------
1. Uni-xdcr between 2-node clusters, default bucket
2. Load 30K items on source
3. Pause XDCR
4. Start "rebalance-out" of one node each from both clusters simultaneously.
5. Resume xdcr

Rebalance at source proceeds to completion, rebalance on dest hangs at 10%, see -

',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at cluster 172.23.106.47
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:07,325] - [rest_client:1216] INFO - rebalance percentage : 100 %
[2014-07-24 13:28:30,222] - [task:411] INFO - rebalancing was completed with progress: 100% in 83.475001812 sec
[2014-07-24 13:28:30,223] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:28:30,229] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:40,252] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:28:50,280] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:00,301] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:10,342] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:20,363] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:30,389] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:40,410] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:29:50,437] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:00,458] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:10,480] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:20,504] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:30,523] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:40,546] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:30:50,569] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %

Testcase
--------------
./testrunner -i uni-xdcr.ini -t xdcr.pauseResumeXDCR.PauseResumeTest.replication_with_pause_and_resume,items=30000,rdirection=unidirection,ctopology=chain,replication_type=xmem,rebalance_out=source-destination,pause=source,GROUP=P1


The rebalance hang to explain the missing replica items?

[2014-07-24 13:31:49,079] - [task:463] INFO - Saw curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,103] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,343] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:49,536] - [task:463] INFO - Saw vb_active_curr_items 30000 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:31:49,559] - [data_helper:289] INFO - creating direct client 172.23.106.47:11210 default
[2014-07-24 13:31:49,811] - [data_helper:289] INFO - creating direct client 172.23.106.48:11210 default
[2014-07-24 13:31:50,001] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:31:55,045] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:00,080] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:32:05,113] - [task:459] WARNING - Not Ready: vb_replica_curr_items 27700 == 30000 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

Logs
-------------
will attach cbcollect with xdcr trace logging.

 Comments   
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Do you have _any reason at all_ to believe that it's even remotely related to xdcr ? Specifically xdcr does nothing about upr replicas.
Comment by Aruna Piravi [ 24/Jul/14 ]
I, of course _do_ know that replicas have nothing to do with xdcr. But I'm unsure if xdcr, and parallel rebalance contributed to the hang.
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
I cannot diagnose stuck rebalance when logs are capture after cleanup.
Comment by Aruna Piravi [ 24/Jul/14 ]
And more on why I think so ---

Pls note from logs below that there has been no progress in rebalance at the destination _from_ the time we resumed xdcr. Until then it had progressed to 10%.

[2014-07-24 13:26:59,500] - [pauseResumeXDCR:92] INFO - ##### Pausing xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:26:59,541] - [rest_client:1757] INFO - Updated pauseRequested=true on bucket'default' on 172.23.106.45
[2014-07-24 13:26:59,968] - [task:517] WARNING - Not Ready: xdc_ops 1734 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket
[2014-07-24 13:27:00,145] - [task:521] INFO - Saw replication_docs_rep_queue 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:00,339] - [task:517] WARNING - Not Ready: replication_active_vbreps 16 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091', default bucket
[2014-07-24 13:27:05,490] - [task:521] INFO - Saw xdc_ops 0 == 0 expected on '172.23.106.47:8091''172.23.106.48:8091',default bucket
[2014-07-24 13:27:05,697] - [task:521] INFO - Saw replication_active_vbreps 0 == 0 expected on '172.23.106.45:8091''172.23.106.46:8091',default bucket
[2014-07-24 13:27:05,728] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.46'] at source cluster 172.23.106.45
[2014-07-24 13:27:05,760] - [xdcrbasetests:642] INFO - Starting rebalance-out nodes:['172.23.106.48'] at source cluster 172.23.106.47
[2014-07-24 13:27:05,761] - [xdcrbasetests:372] INFO - sleep for 5 secs. ...
[2014-07-24 13:27:06,733] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 13:27:06,746] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,773] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 13:27:06,796] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 13:27:06,806] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:06,816] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 13:27:10,823] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 13:27:10,860] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 13:27:11,101] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 894
[2014-07-24 13:27:11,102] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:12,043] - [task:521] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 13:27:12,260] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 869
[2014-07-24 13:27:12,261] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 13:27:13,142] - [task:521] INFO - Saw xdc_ops 4770 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 13:27:13,183] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 13:27:17,174] - [rest_client:1216] INFO - rebalance percentage : 24.21875 %
[2014-07-24 13:27:17,181] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:27,201] - [rest_client:1216] INFO - rebalance percentage : 33.59375 %
[2014-07-24 13:27:27,207] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:37,233] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 13:27:37,242] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:47,263] - [rest_client:1216] INFO - rebalance percentage : 53.90625 %
[2014-07-24 13:27:47,272] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
[2014-07-24 13:27:57,294] - [rest_client:1216] INFO - rebalance percentage : 60.8723958333 %
[2014-07-24 13:27:57,304] - [rest_client:1216] INFO - rebalance percentage : 10.0911458333 %
Comment by Aruna Piravi [ 24/Jul/14 ]
Live cluster

http://172.23.106.45:8091/
http://172.23.106.47:8091/ <-- rebalance stuck
Comment by Aruna Piravi [ 24/Jul/14 ]
New logs attached.
Comment by Aruna Piravi [ 24/Jul/14 ]
Didn't try pausing replication from source cluster. Wanted the leave the cluster in same state.

.47 started receiving data through resumed xdcr from 20:04:01. The last recorded rebalance progress was 8.7890625 % at 20:04:05 on .47. Could have stopped a few secs before that.

[2014-07-24 20:03:55,538] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.46&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
[2014-07-24 20:03:55,547] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,569] - [rest_client:1095] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.48&user=Administrator&knownNodes=ns_1%40172.23.106.47%2Cns_1%40172.23.106.48
[2014-07-24 20:03:55,578] - [rest_client:1099] INFO - rebalance operation started
[2014-07-24 20:03:55,584] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:55,592] - [rest_client:1216] INFO - rebalance percentage : 0 %
[2014-07-24 20:03:59,629] - [pauseResumeXDCR:111] INFO - ##### Resume xdcr on node:172.23.106.45, src_bucket:default and dest_bucket:default #####
[2014-07-24 20:03:59,665] - [rest_client:1757] INFO - Updated pauseRequested=false on bucket'default' on 172.23.106.45
[2014-07-24 20:03:59,799] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1010
[2014-07-24 20:03:59,800] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:00,803] - [task:523] INFO - Saw replication_active_vbreps 0 >= 0 expected on '172.23.106.45:8091',default bucket
[2014-07-24 20:04:01,019] - [pauseResumeXDCR:215] INFO - Outbound mutations on 172.23.106.45 is 1082
[2014-07-24 20:04:01,020] - [pauseResumeXDCR:216] INFO - Node 172.23.106.45 is replicating
[2014-07-24 20:04:01,877] - [task:523] INFO - Saw xdc_ops 4981 >= 0 expected on '172.23.106.47:8091',default bucket
[2014-07-24 20:04:01,888] - [pauseResumeXDCR:331] INFO - Waiting for rebalance to complete...
[2014-07-24 20:04:05,894] - [rest_client:1216] INFO - rebalance percentage : 10.7421875 %
[2014-07-24 20:04:05,905] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:15,927] - [rest_client:1216] INFO - rebalance percentage : 19.53125 %
[2014-07-24 20:04:15,937] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:25,956] - [rest_client:1216] INFO - rebalance percentage : 26.7578125 %
[2014-07-24 20:04:25,964] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:35,995] - [rest_client:1216] INFO - rebalance percentage : 41.9921875 %
[2014-07-24 20:04:36,007] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:46,030] - [rest_client:1216] INFO - rebalance percentage : 50.9114583333 %
[2014-07-24 20:04:46,037] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:04:56,060] - [rest_client:1216] INFO - rebalance percentage : 59.7005208333 %
[2014-07-24 20:04:56,068] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
[2014-07-24 20:05:06,087] - [rest_client:1216] INFO - rebalance percentage : 99.9348958333 %
[2014-07-24 20:05:06,096] - [rest_client:1216] INFO - rebalance percentage : 8.7890625 %
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Same symptoms as MB-11809:

     {<0.4446.17>,
      [{registered_name,[]},
       {status,waiting},
       {initial_call,{proc_lib,init_p,3}},
       {backtrace,[<<"Program counter: 0x00007fdb6c22ffa0 (gen:do_call/4 + 392)">>,
                   <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                   <<>>,
                   <<"0x00007fdb1022d3a8 Return addr 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<"y(0) #Ref<0.0.12.179156>">>,<<"y(1) infinity">>,
                   <<"y(2) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(3) '$gen_call'">>,<<"y(4) <0.147.17>">>,
                   <<"y(5) []">>,<<>>,
                   <<"0x00007fdb1022d3e0 Return addr 0x00007fdb1b1ed020 (janitor_agent:'-spawn_rebalance_subprocess/3-fun-0-'/3 + 200)">>,
                   <<"y(0) infinity">>,
                   <<"y(1) {dcp_takeover,'ns_1@172.23.106.48',955}">>,
                   <<"y(2) 'replication_manager-default'">>,
                   <<"y(3) Catch 0x00007fdb689ced78 (gen_server:call/3 + 128)">>,
                   <<>>,
                   <<"0x00007fdb1022d408 Return addr 0x00007fdb6c2338a0 (proc_lib:init_p/3 + 688)">>,
                   <<"y(0) <0.160.17>">>,<<>>,
                   <<"0x00007fdb1022d418 Return addr 0x0000000000871ff8 (<terminate process normally>)">>,
                   <<"y(0) []">>,
                   <<"y(1) Catch 0x00007fdb6c2338c0 (proc_lib:init_p/3 + 720)">>,
                   <<"y(2) []">>,<<>>]},
       {error_handler,error_handler},
       {garbage_collection,[{min_bin_vheap_size,46422},
                            {min_heap_size,233},
                            {fullsweep_after,512},
                            {minor_gcs,0}]},
       {heap_size,233},
       {total_heap_size,233},
       {links,[<0.160.17>,<0.186.17>]},
       {memory,2816},
       {message_queue_len,0},
       {reductions,29},
       {trap_exit,false}]}
Comment by Aleksey Kondratenko [ 24/Jul/14 ]
Aruna, consider pausing xdcr. It is likely unrelated to xdcr given MB- reference above
Comment by Aruna Piravi [ 25/Jul/14 ]
I paused xdcr last night. No progress on rebalance yet. That rules out xdcr completely?
Comment by Aruna Piravi [ 25/Jul/14 ]
Raising as test blocker. ~10 tests failed to this rebalance hang problem. Feel free to close if found to be a duplicate if MB-11809.
Comment by Mike Wiederhold [ 25/Jul/14 ]
Duplicate of MB-11809




[MB-11559] Memcached segfault right after initial cluster setup (master builds) Created: 26/Jun/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: bug-backlog
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Pavel Paulau Assignee: Dave Rigby
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: couchbase-server-enterprise_centos6_x86_64_0.0.0-1564-rel.rpm

Attachments: Zip Archive 000-1564.zip     Text File gdb.log    
Issue Links:
Duplicate
is duplicated by MB-11562 memcached crash with segmentation fau... Resolved
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Comments   
Comment by Dave Rigby [ 28/Jun/14 ]
This is caused by some of the changes added (on 3.0.1 branch) by MB-11067. Fix incoming (prob Monday).
Comment by Dave Rigby [ 30/Jun/14 ]
http://review.couchbase.org/#/c/38968/

Note: depends on refactor of stats code: http://review.couchbase.org/#/c/38967




[MB-11811] [Tools] Change UPR to DCP for tools Created: 24/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Major
Reporter: Bin Cui Assignee: Bin Cui
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Comments   
Comment by Bin Cui [ 24/Jul/14 ]
http://review.couchbase.org/#/c/39814/




[MB-11785] mcd aborted in bucket_engine_release_cookie: "es != ((void *)0)" Created: 22/Jul/14  Updated: 25/Jul/14  Resolved: 25/Jul/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Tommie McAfee
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 64 vb cluster_run -n1

Attachments: Zip Archive collectinfo-2014-07-22T192534-n_0@127.0.0.1.zip    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Observed while running pyupr unit tests against latest from rel-3.0.0 branch.

 After about 20 tests the crash occurred on test_failover_log_n_producers_n_vbuckets. This test passes stand alone so I think it's a matter of running all the tests in succession and then coming across this issue.

backtrace:

Thread 228 (Thread 0x7fed2e7fc700 (LWP 695)):
#0 0x00007fed8b608f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed8b60c388 in __GI_abort () at abort.c:89
#2 0x00007fed8b601e36 in __assert_fail_base (fmt=0x7fed8b753718 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fed8949f28c "es != ((void *)0)",
    file=file@entry=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=line@entry=3301,
    function=function@entry=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:92
#3 0x00007fed8b601ee2 in __GI___assert_fail (assertion=0x7fed8949f28c "es != ((void *)0)",
    file=0x7fed8949ea60 "/couchbase/memcached/engines/bucket_engine/bucket_engine.c", line=3301,
    function=0x7fed8949f6e0 <__PRETTY_FUNCTION__.10066> "bucket_engine_release_cookie") at assert.c:101
#4 0x00007fed8949d13d in bucket_engine_release_cookie (cookie=0x5b422e0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:3301
#5 0x00007fed8835343f in EventuallyPersistentEngine::releaseCookie (this=0x7fed4808f5d0, cookie=0x5b422e0)
    at /couchbase/ep-engine/src/ep_engine.cc:1883
#6 0x00007fed8838d730 in ConnHandler::releaseReference (this=0x7fed7c0544e0, force=false)
    at /couchbase/ep-engine/src/tapconnection.cc:306
#7 0x00007fed883a4de6 in UprConnMap::shutdownAllConnections (this=0x7fed4806e4e0)
    at /couchbase/ep-engine/src/tapconnmap.cc:1004
#8 0x00007fed88353e0a in EventuallyPersistentEngine::destroy (this=0x7fed4808f5d0, force=true)
    at /couchbase/ep-engine/src/ep_engine.cc:2034
#9 0x00007fed8834dc05 in EvpDestroy (handle=0x7fed4808f5d0, force=true) at /couchbase/ep-engine/src/ep_engine.cc:142
#10 0x00007fed89498a54 in engine_shutdown_thread (arg=0x7fed48080540)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1564
#11 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480a5b60) at /couchbase/platform/src/cb_pthreads.c:19
#12 0x00007fed8beba182 in start_thread (arg=0x7fed2e7fc700) at pthread_create.c:312
#13 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 226 (Thread 0x7fed71790700 (LWP 693)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093e80, mutex=0x7fed78093e48, ms=720)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78093e40, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78093e40, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78093e40, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78093e40, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=122 'z')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801d610) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480203e0) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71790700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 225 (Thread 0x7fed71f91700 (LWP 692)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78093830, mutex=0x7fed780937f8, ms=86390052)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780937f0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780937f0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780937f0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780937f0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=46 '.')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed4801a6c0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801d490) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed71f91700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 224 (Thread 0x7fed72792700 (LWP 691)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3894)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=173 '\255')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480178a0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4801a670) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed72792700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 223 (Thread 0x7fed70f8f700 (LWP 690)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed78092bc0, mutex=0x7fed78092b88, ms=3893)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed78092b80, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed78092b80, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed78092b80, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed78092b80, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=147 '\223')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48014a80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed48017850) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed70f8f700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 222 (Thread 0x7fed7078e700 (LWP 689)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1672)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=61 '=')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed48011c80) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b8e90) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed7078e700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 221 (Thread 0x7fed0effd700 (LWP 688)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007fed8cf43d97 in cb_cond_timedwait (cond=0x7fed780931e0, mutex=0x7fed780931a8, ms=1673)
    at /couchbase/platform/src/cb_pthreads.c:156
#2 0x00007fed8837840a in SyncObject::wait (this=0x7fed780931a0, tv=...) at /couchbase/ep-engine/src/syncobject.h:74
#3 0x00007fed883ae36f in TaskQueue::_doSleep (this=0x7fed780931a0, t=...) at /couchbase/ep-engine/src/taskqueue.cc:76
#4 0x00007fed883ae434 in TaskQueue::_fetchNextTask (this=0x7fed780931a0, t=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:97
#5 0x00007fed883ae817 in TaskQueue::fetchNextTask (this=0x7fed780931a0, thread=..., toSleep=true)
    at /couchbase/ep-engine/src/taskqueue.cc:145
#6 0x00007fed883755b2 in ExecutorPool::_nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:181
#7 0x00007fed88375655 in ExecutorPool::nextTask (this=0x7fed78089560, t=..., tick=50 '2')
    at /couchbase/ep-engine/src/executorpool.cc:196
#8 0x00007fed883889de in ExecutorThread::run (this=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:77
#9 0x00007fed88388591 in launch_executor_thread (arg=0x7fed480b67e0) at /couchbase/ep-engine/src/executorthread.cc:33
#10 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480b6890) at /couchbase/platform/src/cb_pthreads.c:19
#11 0x00007fed8beba182 in start_thread (arg=0x7fed0effd700) at pthread_create.c:312
#12 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 210 (Thread 0x7fed0f7fe700 (LWP 661)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed740e8910)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed740667e0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 201 (Thread 0x7fed0ffff700 (LWP 644)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed74135070)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed74050ef0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed0ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

---Type <return> to continue, or q <return> to quit---
Thread 192 (Thread 0x7fed2cff9700 (LWP 627)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7c90)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c078340) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2cff9700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 183 (Thread 0x7fed2d7fa700 (LWP 610)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009e000)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5009dfe0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2d7fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 174 (Thread 0x7fed2dffb700 (LWP 593)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5009dc30)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed50031010) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2dffb700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 165 (Thread 0x7fed2f7fe700 (LWP 576)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed481cef20)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed480921c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2f7fe700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 147 (Thread 0x7fed2effd700 (LWP 541)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed540015d0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54057b80) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2effd700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 138 (Thread 0x7fed6df89700 (LWP 523)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed78092aa0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78056ea0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6df89700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 120 (Thread 0x7fed2ffff700 (LWP 489)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed7c1b7d10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed7c1b7ac0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed2ffff700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 111 (Thread 0x7fed6cf87700 (LWP 472)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed5008c030)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500adf50) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6cf87700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 102 (Thread 0x7fed6d788700 (LWP 455)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
---Type <return> to continue, or q <return> to quit---
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080450)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54091560) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6d788700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 93 (Thread 0x7fed6ff8d700 (LWP 438)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed54080ad0)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed54068db0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ff8d700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 57 (Thread 0x7fed6e78a700 (LWP 370)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50080230)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed5008c360) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6e78a700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 48 (Thread 0x7fed6ef8b700 (LWP 352)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed50000c10)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed500815b0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6ef8b700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 39 (Thread 0x7fed6f78c700 (LWP 334)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fed8cf43c34 in cb_cond_wait (cond=0x7fed896a2648 <bucket_engine+840>, mutex=0x7fed896a25f0 <bucket_engine+752>)
    at /couchbase/platform/src/cb_pthreads.c:119
#2 0x00007fed89498c13 in engine_shutdown_thread (arg=0x7fed4807c290)
    at /couchbase/memcached/engines/bucket_engine/bucket_engine.c:1610
---Type <return> to continue, or q <return> to quit---
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed4806e4c0) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed6f78c700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread 13 (Thread 0x7fed817fa700 (LWP 292)):
#0 0x00007fed8b693d7d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fed8b6c5334 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x00007fed88386dd2 in updateStatsThread (arg=0x7fed780343f0) at /couchbase/ep-engine/src/memory_tracker.cc:36
#3 0x00007fed8cf43963 in platform_thread_wrap (arg=0x7fed78034450) at /couchbase/platform/src/cb_pthreads.c:19
#4 0x00007fed8beba182 in start_thread (arg=0x7fed817fa700) at pthread_create.c:312
#5 0x00007fed8b6cd30d in clone () at .