[MB-12200] Seg fault during indexing on view-toy build testing Created: 16/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Harsha Havanur
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: -3.0.0-700-hhs-toy
-Cen 64 Machines
- 7 Node cluster, 2 Buckets, 2 Views

Attachments: Zip Archive 10.6.2.168-9162014-106-diag.zip     Zip Archive 10.6.2.187-9162014-1010-diag.zip     File crash_beam.smp.rtf     File crash_toybuild.rtf    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
1. Load 70M, 100M on either bucket
2. Wait for initial indexing to complete
3. Start updates on the cluster 1K gets, 7K sets across the cluster

Seeing numerous cores from beam.smp.

Stack is attached.

Adding logs from the nodes.


 Comments   
Comment by Sriram Melkote [ 16/Sep/14 ]
Harsha, this appears to clearly be a NIF related regression. We need to discuss why our own testing didn't find this after you figure out the problem.




[MB-12199] curl -H arguments need to use double quotes Created: 16/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.0, 2.5.1, 3.0.1, 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Current documentation states:

Indicates that an HTTP PUT operation is requested.
-H 'Content-Type: application/json'

And that will fail, seemingly owing to the single quotes. See also:
https://twitter.com/RamSharp/status/511739806528077824





[MB-12197] [Windows]: Bucket deletion failing with error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue."} Created: 16/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Meenakshi Goel Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.1-1299-rel

Attachments: Text File test.txt    
Triage: Triaged
Operating System: Windows 64-bit
Is this a Regression?: Yes

 Description   
Jenkins Ref Link:
http://qa.hq.northscale.net/job/win_2008_x64--14_01--replica_read-P0/32/consoleFull
http://qa.hq.northscale.net/job/win_2008_x64--59--01--bucket_flush-P1/14/console
http://qa.hq.northscale.net/job/win_2008_x64--59_01--warmup-P1/6/consoleFull

Test to Reproduce:
newmemcapable.GetrTests.getr_test,nodes_init=4,GROUP=P0,expiration=60,wait_expiration=true,error=Not found for vbucket,descr=#simple getr replica_count=1 expiration=60 flags = 0 docs_ops=create cluster ops = None
flush.bucketflush.BucketFlushTests.bucketflush,items=20000,nodes_in=3,GROUP=P0

*Note that test doesn't fail but further do fails with "error 400 reason: unknown ["Prepare join failed. Node is already part of cluster."]" because cleanup wasn't successful.

Logs:
[rebalance:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.6938.0>:ns_rebalancer:do_wait_buckets_shutdown:307]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.182',
                                                         {'EXIT',
                                                          {old_buckets_shutdown_wait_failed,
                                                           ["default"]}}}]

[error_logger:error,2014-09-15T9:36:01.989,ns_1@10.3.121.182:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.6938.0>
    registered_name: []
    exception exit: {buckets_shutdown_wait_failed,
                        [{'ns_1@10.3.121.182',
                             {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                     ["default"]}}}]}
      in function ns_rebalancer:do_wait_buckets_shutdown/1 (src/ns_rebalancer.erl, line 308)
      in call from ns_rebalancer:rebalance/5 (src/ns_rebalancer.erl, line 361)
    ancestors: [<0.811.0>,mb_master_sup,mb_master,ns_server_sup,
                  ns_server_cluster_sup,<0.57.0>]
    messages: []
    links: [<0.811.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 46422
    stack_size: 27
    reductions: 5472
  neighbours:

[user:info,2014-09-15T9:36:01.989,ns_1@10.3.121.182:<0.811.0>:ns_orchestrator:handle_info:483]Rebalance exited with reason {buckets_shutdown_wait_failed,
                              [{'ns_1@10.3.121.182',
                                {'EXIT',
                                 {old_buckets_shutdown_wait_failed,
                                  ["default"]}}}]}
[ns_server:error,2014-09-15T9:36:09.645,ns_1@10.3.121.182:ns_memcached-default<0.4908.0>:ns_memcached:terminate:798]Failed to delete bucket "default": {error,{badmatch,{error,closed}}}

Uploading Logs

 Comments   
Comment by Meenakshi Goel [ 16/Sep/14 ]
https://s3.amazonaws.com/bugdb/jira/MB-12197/11dd43ca/10.3.121.182-9152014-938-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/e7795065/10.3.121.183-9152014-940-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/6442301b/10.3.121.102-9152014-942-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/10edf209/10.3.121.107-9152014-943-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-12197/9f16f503/10.1.2.66-9152014-945-diag.zip




[MB-12196] [Windows] When I run cbworkloadgen.exe, I see a Warning message Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build 1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install 3.0.1_1299 build
Go to bin directory on the installation directory, run cbworkloadgen.exe
You will see the following warning:
WARNING:root:could not import snappy module. Compress/uncompress function will be skipped.

Expected behavior: The above warning should not appear





[MB-12195] Update notifications does not seem to be working Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Ian McCloy
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos 5.8
2.5.0

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I have installed 2.5.0 build and enabled Update Notifications
Even though I enabled "Enable software Update Notifications", I keep getting "No Updates available"
I thought I will be notified in the UI that there is a 2.5.1 is available.

I have consulted Tony to see if I have done something wrong but he also confirmed that this seems to be an issue and is a bug

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
Based on dev tools we're getting "no new version" from phone home requests. So it's not UI bug.




[MB-12194] [Windows] When you try to uninstall CB server it comes up with Installer wizard instead of uninstall Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Raju Suravarjjala Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows 7
Build: 3.0.1_1299

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Install Windows 3.0.1_1299 build
Try to uninstall the CB server
You will see the CB InstallShield Installation Wizard and then it comes up with the prompt of removing the selected application and all of its features

Expected result: It would be nice to come up with Uninstall Wizard instead of confusing Installation wizard




[MB-12193] Docs should explicitly state that we don't support online downgrades in the installation guide Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Gokul Krishnan Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
In the installation guide, we should call out the fact that online downgrades (from 3.0 to 2.5.1) isn't supported and downgrades will require servers to be taken offline.

 Comments   
Comment by Ruth Harris [ 15/Sep/14 ]
In the 3.0 documentation:

Upgrading >
<note type="important">Online downgrades from 3.0 to 2.5.1 is not supported. Downgrades require that servers be taken offline.</note>

Should this be in the release notes too?
Comment by Matt Ingenthron [ 15/Sep/14 ]
"online" or "any"?




[MB-12192] XDCR : After warmup, replica items are not deleted in destination cluster Created: 15/Sep/14  Updated: 16/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket, DCP
Affects Version/s: 3.0.1
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aruna Piravi Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: CentOS 6.x, 3.0.1-1297-rel

Attachments: Zip Archive 172.23.106.45-9152014-1553-diag.zip     GZip Archive 172.23.106.45-9152014-1623-couch.tar.gz     Zip Archive 172.23.106.46-9152014-1555-diag.zip     GZip Archive 172.23.106.46-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.47-9152014-1558-diag.zip     GZip Archive 172.23.106.47-9152014-1624-couch.tar.gz     Zip Archive 172.23.106.48-9152014-160-diag.zip     GZip Archive 172.23.106.48-9152014-1624-couch.tar.gz    
Triage: Untriaged
Is this a Regression?: Yes

 Description   
Steps
--------
1. Setup uni-xdcr between 2 clusters with atleast 2 nodes
2. Load 5000 items onto 3 buckets at source, they get replicated to destination
3. Reboot a non-master node on destination (in this test .48)
4. After warmup, perform 30% updates and 30% deletes on source cluster
5. Deletes get propagated to active vbuckets on destination but replica vbuckets only experience partial deletion.

Important note
--------------------
This test had passed on 3.0.0-1208-rel and 3.0.0-1209-rel. However I'm able to reproduce this consistently on 3.0.1. Unsure if this is a recent regression.

2014-09-15 14:43:50 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', sasl_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', standard_bucket_1 bucket
2014-09-15 14:43:51 | WARNING | MainProcess | Cluster_Thread | [task.check] Not Ready: vb_replica_curr_items 4250 == 3500 expected on '172.23.106.47:8091''172.23.106.48:8091', default bucket

On destination cluster
-----------------------------

Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.47:11210,172.23.106.48:11210
VBucket 512: active count 4 != 6 replica count

VBucket 513: active count 2 != 4 replica count

VBucket 514: active count 8 != 11 replica count

VBucket 515: active count 3 != 4 replica count

VBucket 516: active count 8 != 10 replica count

VBucket 517: active count 5 != 6 replica count

VBucket 521: active count 0 != 1 replica count

VBucket 522: active count 7 != 11 replica count

VBucket 523: active count 3 != 5 replica count

VBucket 524: active count 6 != 10 replica count

VBucket 525: active count 4 != 6 replica count

VBucket 526: active count 4 != 6 replica count

VBucket 528: active count 7 != 10 replica count

VBucket 529: active count 3 != 4 replica count

VBucket 530: active count 3 != 4 replica count

VBucket 532: active count 0 != 2 replica count

VBucket 533: active count 1 != 2 replica count

VBucket 534: active count 8 != 10 replica count

VBucket 535: active count 5 != 6 replica count

VBucket 536: active count 7 != 11 replica count

VBucket 537: active count 3 != 5 replica count

VBucket 540: active count 3 != 4 replica count

VBucket 542: active count 6 != 10 replica count

VBucket 543: active count 4 != 6 replica count

VBucket 544: active count 6 != 10 replica count

VBucket 545: active count 3 != 4 replica count

VBucket 547: active count 0 != 1 replica count

VBucket 548: active count 6 != 7 replica count

VBucket 550: active count 7 != 10 replica count

VBucket 551: active count 4 != 5 replica count

VBucket 552: active count 9 != 11 replica count

VBucket 553: active count 4 != 6 replica count

VBucket 554: active count 4 != 5 replica count

VBucket 555: active count 1 != 2 replica count

VBucket 558: active count 7 != 10 replica count

VBucket 559: active count 3 != 4 replica count

VBucket 562: active count 6 != 10 replica count

VBucket 563: active count 4 != 5 replica count

VBucket 564: active count 7 != 10 replica count

VBucket 565: active count 4 != 5 replica count

VBucket 566: active count 4 != 5 replica count

VBucket 568: active count 3 != 4 replica count

VBucket 570: active count 8 != 10 replica count

VBucket 571: active count 4 != 6 replica count

VBucket 572: active count 7 != 10 replica count

VBucket 573: active count 3 != 4 replica count

VBucket 574: active count 0 != 1 replica count

VBucket 575: active count 0 != 1 replica count

VBucket 578: active count 8 != 10 replica count

VBucket 579: active count 4 != 6 replica count

VBucket 580: active count 8 != 11 replica count

VBucket 581: active count 3 != 4 replica count

VBucket 582: active count 3 != 4 replica count

VBucket 583: active count 1 != 2 replica count

VBucket 584: active count 3 != 4 replica count

VBucket 586: active count 6 != 10 replica count

VBucket 587: active count 3 != 4 replica count

VBucket 588: active count 7 != 10 replica count

VBucket 589: active count 4 != 5 replica count

VBucket 591: active count 0 != 2 replica count

VBucket 592: active count 8 != 10 replica count

VBucket 593: active count 4 != 6 replica count

VBucket 594: active count 0 != 1 replica count

VBucket 595: active count 0 != 1 replica count

VBucket 596: active count 4 != 6 replica count

VBucket 598: active count 7 != 10 replica count

VBucket 599: active count 3 != 4 replica count

VBucket 600: active count 6 != 10 replica count

VBucket 601: active count 3 != 4 replica count

VBucket 602: active count 4 != 6 replica count

VBucket 606: active count 7 != 10 replica count

VBucket 607: active count 4 != 5 replica count

VBucket 608: active count 7 != 11 replica count

VBucket 609: active count 3 != 5 replica count

VBucket 610: active count 3 != 4 replica count

VBucket 613: active count 0 != 1 replica count

VBucket 614: active count 6 != 10 replica count

VBucket 615: active count 4 != 6 replica count

VBucket 616: active count 7 != 10 replica count

VBucket 617: active count 3 != 4 replica count

VBucket 620: active count 3 != 4 replica count

VBucket 621: active count 1 != 2 replica count

VBucket 622: active count 9 != 11 replica count

VBucket 623: active count 5 != 6 replica count

VBucket 624: active count 5 != 6 replica count

VBucket 626: active count 7 != 11 replica count

VBucket 627: active count 3 != 5 replica count

VBucket 628: active count 6 != 10 replica count

VBucket 629: active count 4 != 6 replica count

VBucket 632: active count 0 != 1 replica count

VBucket 633: active count 0 != 1 replica count

VBucket 634: active count 7 != 10 replica count

VBucket 635: active count 3 != 4 replica count

VBucket 636: active count 8 != 10 replica count

VBucket 637: active count 5 != 6 replica count

VBucket 638: active count 5 != 6 replica count

VBucket 640: active count 2 != 4 replica count

VBucket 641: active count 7 != 11 replica count

VBucket 643: active count 5 != 7 replica count

VBucket 646: active count 3 != 5 replica count

VBucket 647: active count 7 != 10 replica count

VBucket 648: active count 4 != 6 replica count

VBucket 649: active count 8 != 10 replica count

VBucket 651: active count 0 != 1 replica count

VBucket 653: active count 4 != 6 replica count

VBucket 654: active count 3 != 4 replica count

VBucket 655: active count 7 != 10 replica count

VBucket 657: active count 4 != 5 replica count

VBucket 658: active count 2 != 4 replica count

VBucket 659: active count 7 != 11 replica count

VBucket 660: active count 3 != 5 replica count

VBucket 661: active count 7 != 10 replica count

VBucket 662: active count 0 != 2 replica count

VBucket 666: active count 4 != 6 replica count

VBucket 667: active count 8 != 10 replica count

VBucket 668: active count 3 != 4 replica count

VBucket 669: active count 7 != 10 replica count

VBucket 670: active count 1 != 2 replica count

VBucket 671: active count 2 != 3 replica count

VBucket 673: active count 0 != 1 replica count

VBucket 674: active count 3 != 4 replica count

VBucket 675: active count 7 != 10 replica count

VBucket 676: active count 5 != 6 replica count

VBucket 677: active count 8 != 10 replica count

VBucket 679: active count 5 != 6 replica count

VBucket 681: active count 6 != 7 replica count

VBucket 682: active count 3 != 5 replica count

VBucket 683: active count 8 != 12 replica count

VBucket 684: active count 3 != 6 replica count

VBucket 685: active count 7 != 11 replica count

VBucket 688: active count 3 != 4 replica count

VBucket 689: active count 7 != 10 replica count

VBucket 692: active count 1 != 2 replica count

VBucket 693: active count 2 != 3 replica count

VBucket 694: active count 5 != 6 replica count

VBucket 695: active count 8 != 10 replica count

VBucket 696: active count 3 != 5 replica count

VBucket 697: active count 8 != 12 replica count

VBucket 699: active count 4 != 5 replica count

VBucket 700: active count 0 != 1 replica count

VBucket 702: active count 3 != 6 replica count

VBucket 703: active count 7 != 11 replica count

VBucket 704: active count 3 != 5 replica count

VBucket 705: active count 8 != 12 replica count

VBucket 709: active count 4 != 5 replica count

VBucket 710: active count 3 != 6 replica count

VBucket 711: active count 7 != 11 replica count

VBucket 712: active count 3 != 4 replica count

VBucket 713: active count 7 != 10 replica count

VBucket 715: active count 3 != 4 replica count

VBucket 716: active count 1 != 2 replica count

VBucket 717: active count 0 != 2 replica count

VBucket 718: active count 5 != 6 replica count

VBucket 719: active count 8 != 10 replica count

VBucket 720: active count 0 != 1 replica count

VBucket 722: active count 3 != 5 replica count

VBucket 723: active count 8 != 12 replica count

VBucket 724: active count 3 != 6 replica count

VBucket 725: active count 7 != 11 replica count

VBucket 727: active count 5 != 7 replica count

VBucket 728: active count 2 != 4 replica count

VBucket 729: active count 3 != 5 replica count

VBucket 730: active count 3 != 4 replica count

VBucket 731: active count 7 != 10 replica count

VBucket 732: active count 5 != 6 replica count

VBucket 733: active count 8 != 10 replica count

VBucket 737: active count 3 != 4 replica count

VBucket 738: active count 4 != 6 replica count

VBucket 739: active count 8 != 10 replica count

VBucket 740: active count 3 != 4 replica count

VBucket 741: active count 7 != 10 replica count

VBucket 743: active count 0 != 1 replica count

VBucket 746: active count 2 != 4 replica count

VBucket 747: active count 7 != 11 replica count

VBucket 748: active count 3 != 5 replica count

VBucket 749: active count 7 != 10 replica count

VBucket 751: active count 3 != 4 replica count

VBucket 752: active count 4 != 6 replica count

VBucket 753: active count 9 != 11 replica count

VBucket 754: active count 1 != 2 replica count

VBucket 755: active count 4 != 5 replica count

VBucket 758: active count 3 != 4 replica count

VBucket 759: active count 7 != 10 replica count

VBucket 760: active count 2 != 4 replica count

VBucket 761: active count 7 != 11 replica count

VBucket 762: active count 0 != 1 replica count

VBucket 765: active count 6 != 7 replica count

VBucket 766: active count 3 != 5 replica count

VBucket 767: active count 7 != 10 replica count

VBucket 770: active count 3 != 5 replica count

VBucket 771: active count 7 != 11 replica count

VBucket 772: active count 4 != 6 replica count

VBucket 773: active count 6 != 10 replica count

VBucket 775: active count 3 != 4 replica count

VBucket 777: active count 3 != 4 replica count

VBucket 778: active count 3 != 4 replica count

VBucket 779: active count 7 != 10 replica count

VBucket 780: active count 5 != 6 replica count

VBucket 781: active count 8 != 10 replica count

VBucket 782: active count 1 != 2 replica count

VBucket 783: active count 0 != 2 replica count

VBucket 784: active count 3 != 5 replica count

VBucket 785: active count 7 != 11 replica count

VBucket 786: active count 0 != 1 replica count

VBucket 789: active count 4 != 6 replica count

VBucket 790: active count 4 != 6 replica count

VBucket 791: active count 6 != 10 replica count

VBucket 792: active count 3 != 4 replica count

VBucket 793: active count 8 != 11 replica count

VBucket 794: active count 2 != 4 replica count

VBucket 795: active count 4 != 6 replica count

VBucket 798: active count 5 != 6 replica count

VBucket 799: active count 8 != 10 replica count

VBucket 800: active count 4 != 6 replica count

VBucket 801: active count 8 != 10 replica count

VBucket 803: active count 3 != 4 replica count

VBucket 804: active count 0 != 1 replica count

VBucket 805: active count 0 != 1 replica count

VBucket 806: active count 3 != 4 replica count

VBucket 807: active count 7 != 10 replica count

VBucket 808: active count 3 != 4 replica count

VBucket 809: active count 6 != 10 replica count

VBucket 813: active count 4 != 5 replica count

VBucket 814: active count 4 != 5 replica count

VBucket 815: active count 7 != 10 replica count

VBucket 816: active count 1 != 2 replica count

VBucket 817: active count 4 != 5 replica count

VBucket 818: active count 4 != 6 replica count

VBucket 819: active count 8 != 10 replica count

VBucket 820: active count 3 != 4 replica count

VBucket 821: active count 7 != 10 replica count

VBucket 824: active count 0 != 1 replica count

VBucket 826: active count 3 != 4 replica count

VBucket 827: active count 6 != 10 replica count

VBucket 828: active count 4 != 5 replica count

VBucket 829: active count 7 != 10 replica count

VBucket 831: active count 6 != 7 replica count

VBucket 833: active count 4 != 6 replica count

VBucket 834: active count 3 != 4 replica count

VBucket 835: active count 6 != 10 replica count

VBucket 836: active count 4 != 5 replica count

VBucket 837: active count 7 != 10 replica count

VBucket 840: active count 0 != 1 replica count

VBucket 841: active count 0 != 1 replica count

VBucket 842: active count 4 != 6 replica count

VBucket 843: active count 8 != 10 replica count

VBucket 844: active count 3 != 4 replica count

VBucket 845: active count 7 != 10 replica count

VBucket 847: active count 4 != 6 replica count

VBucket 848: active count 3 != 4 replica count

VBucket 849: active count 6 != 10 replica count

VBucket 851: active count 3 != 4 replica count

VBucket 852: active count 0 != 2 replica count

VBucket 854: active count 4 != 5 replica count

VBucket 855: active count 7 != 10 replica count

VBucket 856: active count 4 != 6 replica count

VBucket 857: active count 8 != 10 replica count

VBucket 860: active count 1 != 2 replica count

VBucket 861: active count 3 != 4 replica count

VBucket 862: active count 3 != 4 replica count

VBucket 863: active count 8 != 11 replica count

VBucket 864: active count 3 != 4 replica count

VBucket 865: active count 7 != 10 replica count

VBucket 866: active count 0 != 1 replica count

VBucket 867: active count 0 != 1 replica count

VBucket 869: active count 5 != 6 replica count

VBucket 870: active count 5 != 6 replica count

VBucket 871: active count 8 != 10 replica count

VBucket 872: active count 3 != 5 replica count

VBucket 873: active count 7 != 11 replica count

VBucket 875: active count 5 != 6 replica count

VBucket 878: active count 4 != 6 replica count

VBucket 879: active count 6 != 10 replica count

VBucket 882: active count 3 != 4 replica count

VBucket 883: active count 7 != 10 replica count

VBucket 884: active count 5 != 6 replica count

VBucket 885: active count 9 != 11 replica count

VBucket 886: active count 1 != 2 replica count

VBucket 887: active count 3 != 4 replica count

VBucket 889: active count 3 != 4 replica count

VBucket 890: active count 3 != 5 replica count

VBucket 891: active count 7 != 11 replica count

VBucket 892: active count 4 != 6 replica count

VBucket 893: active count 6 != 10 replica count

VBucket 894: active count 0 != 1 replica count

VBucket 896: active count 8 != 10 replica count

VBucket 897: active count 4 != 6 replica count

VBucket 900: active count 2 != 3 replica count

VBucket 901: active count 2 != 3 replica count

VBucket 902: active count 7 != 10 replica count

VBucket 903: active count 3 != 4 replica count

VBucket 904: active count 7 != 11 replica count

VBucket 905: active count 2 != 4 replica count

VBucket 906: active count 4 != 5 replica count

VBucket 909: active count 0 != 2 replica count

VBucket 910: active count 7 != 10 replica count

VBucket 911: active count 3 != 5 replica count

VBucket 912: active count 0 != 1 replica count

VBucket 914: active count 8 != 10 replica count

VBucket 915: active count 4 != 6 replica count

VBucket 916: active count 7 != 10 replica count

VBucket 917: active count 3 != 4 replica count

VBucket 918: active count 4 != 6 replica count

VBucket 920: active count 5 != 7 replica count

VBucket 922: active count 7 != 11 replica count

VBucket 923: active count 2 != 4 replica count

VBucket 924: active count 7 != 10 replica count

VBucket 925: active count 3 != 5 replica count

VBucket 928: active count 4 != 5 replica count

VBucket 930: active count 8 != 12 replica count

VBucket 931: active count 3 != 5 replica count

VBucket 932: active count 7 != 11 replica count

VBucket 933: active count 3 != 6 replica count

VBucket 935: active count 0 != 1 replica count

VBucket 938: active count 7 != 10 replica count

VBucket 939: active count 3 != 4 replica count

VBucket 940: active count 8 != 10 replica count

VBucket 941: active count 5 != 6 replica count

VBucket 942: active count 2 != 3 replica count

VBucket 943: active count 1 != 2 replica count

VBucket 944: active count 8 != 12 replica count

VBucket 945: active count 3 != 5 replica count

VBucket 946: active count 6 != 7 replica count

VBucket 950: active count 7 != 11 replica count

VBucket 951: active count 3 != 6 replica count

VBucket 952: active count 7 != 10 replica count

VBucket 953: active count 3 != 4 replica count

VBucket 954: active count 0 != 1 replica count

VBucket 956: active count 5 != 6 replica count

VBucket 958: active count 8 != 10 replica count

VBucket 959: active count 5 != 6 replica count

VBucket 960: active count 7 != 10 replica count

VBucket 961: active count 3 != 4 replica count

VBucket 962: active count 3 != 5 replica count

VBucket 963: active count 2 != 4 replica count

VBucket 966: active count 8 != 10 replica count

VBucket 967: active count 5 != 6 replica count

VBucket 968: active count 8 != 12 replica count

VBucket 969: active count 3 != 5 replica count

VBucket 971: active count 0 != 1 replica count

VBucket 972: active count 5 != 7 replica count

VBucket 974: active count 7 != 11 replica count

VBucket 975: active count 3 != 6 replica count

VBucket 976: active count 3 != 4 replica count

VBucket 978: active count 7 != 10 replica count

VBucket 979: active count 3 != 4 replica count

VBucket 980: active count 8 != 10 replica count

VBucket 981: active count 5 != 6 replica count

VBucket 982: active count 0 != 2 replica count

VBucket 983: active count 1 != 2 replica count

VBucket 986: active count 8 != 12 replica count

VBucket 987: active count 3 != 5 replica count

VBucket 988: active count 7 != 11 replica count

VBucket 989: active count 3 != 6 replica count

VBucket 990: active count 4 != 5 replica count

VBucket 993: active count 0 != 1 replica count

VBucket 994: active count 7 != 11 replica count

VBucket 995: active count 2 != 4 replica count

VBucket 996: active count 7 != 10 replica count

VBucket 997: active count 3 != 5 replica count

VBucket 998: active count 5 != 6 replica count

VBucket 1000: active count 4 != 5 replica count

VBucket 1001: active count 1 != 2 replica count

VBucket 1002: active count 9 != 11 replica count

VBucket 1003: active count 4 != 6 replica count

VBucket 1004: active count 7 != 10 replica count

VBucket 1005: active count 3 != 4 replica count

VBucket 1008: active count 7 != 11 replica count

VBucket 1009: active count 2 != 4 replica count

VBucket 1012: active count 4 != 5 replica count

VBucket 1014: active count 7 != 10 replica count

VBucket 1015: active count 3 != 5 replica count

VBucket 1016: active count 8 != 10 replica count

VBucket 1017: active count 4 != 6 replica count

VBucket 1018: active count 3 != 4 replica count

VBucket 1020: active count 0 != 1 replica count

VBucket 1022: active count 7 != 10 replica count

VBucket 1023: active count 3 != 4 replica count

Active item count = 3500

Same at source
----------------------
Arunas-MacBook-Pro:bin apiravi$ ./cbvdiff 172.23.106.45:11210,172.23.106.46:11210
Active item count = 3500

Will attach cbcollect and data files.


 Comments   
Comment by Mike Wiederhold [ 15/Sep/14 ]
This is not a bug. We no longer do this because a replica vbucket cannot delete items on it's own due to dcp.
Comment by Aruna Piravi [ 15/Sep/14 ]
I do not understand why this is not a bug. This is a case where replica items = 4250 and active = 3500. Both were initially 5000 before warmup. However 50% of the actual deletes have happened on replica bucket(5000->4250). And so I would expect the another 750 items to be deleted too so active=replica. If this is not a bug, in case of failover, the cluster will end up having more items than it did before the failover.
Comment by Aruna Piravi [ 15/Sep/14 ]
> We no longer do this because a replica vbucket cannot delete items on it's own due to dcp
Then I would expect the deletes to be propagated from active vbuckets through dcp..but these never get propagated. If you do a cbdiff even now, you can see the mismatch.




[MB-12191] forestdb needs an fdb_destroy() api to clean up a db Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: forestdb
Affects Version/s: feature-backlog
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Sundar Sridharan Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Triaged
Is this a Regression?: Unknown

 Description   
forestdb does not have an option to clean up a database.
Manual deletion of the database files after fdb_close() and fdb_shutdown() is the workaround.
fdb_destroy() option needs to be added which will erase all forestdb files cleanly.




[MB-12190] Typo in the output of couchbase-cli bucket-flush Created: 15/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: tools
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Patrick Varley Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: cli
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
There should be a space between the full stop and Do.

[patrick:~] 2 $ couchbase-cli bucket-flush -b Test -c localhost
Running this command will totally PURGE database data from disk.Do you really want to do it? (Yes/No)

Another Typo when the command times out:

Running this command will totally PURGE database data from disk.Do you really want to do it? (Yes/No)TIMED OUT: command: bucket-flush: localhost:8091, most likely bucket is not flushed





[MB-12189] XDCR REST API "max-concurrency" only works for 1 of 3 documented end-points. Created: 15/Sep/14  Updated: 16/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, RESTful-APIs
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Jim Walker Assignee: Aleksey Kondratenko
Resolution: Done Votes: 0
Labels: xdcr
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Couchbase Server 2.5.1
RHEL 6.4
VM (VirtualBox0
1 node "cluster"

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
This defect relates to the following REST APIs:

* xdcrMaxConcurrentReps (default 32) http://localhost:8091/internalSettings/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/
* maxConcurrentReps (default 32) http://localhost:8091/settings/replications/ <replication_id>

The documentation suggests these all do the same thing, but with the scope of change being different.

<docs>
/settings/replications/ — global settings applied to all replications for a cluster
settings/replications/<replication_id> — settings for specific replication for a bucket
/internalSettings - settings applied to all replications for a cluster. Endpoint exists in Couchbase 2.0 and onward.
</docs>

This defect is because only "settings/replications/<replication_id>" has any effect. The other REST endpoints have no effect.

Out of these APIs I can confirm that changing "/settings/replications/<replication_id>" has an effect. The XDCR code shows that the concurrent reps setting feeds into the concurreny throttle as the number of available tokens. I use xdcr log files where we print the concurrency throttle token data to observe that the setting has an effect.

For example, a cluster in the default configuration has a total tokens of 32. We can grep to see this.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
2014-09-15T13:09:03.886,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.33.1> to node "192.168.69.102:8092" is done normally, total tokens: 32, available tokens: 32,(active reps: 0, waiting reps: 0)

Now changing the setting to 42 the log file shows the change take affect.

curl -u Administrator:password http://localhost:8091/settings/replications/01d38792865ba2d624edb4b2ad2bf07f%2fdefault%2fdefault -d maxConcurrentReps=42

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.*
dcr.1:[xdcr:debug,2014-09-15T13:17:41.112,ns_1@127.0.0.1:<0.32370.0>:concurrency_throttle:clean_concurr_throttle_state:275]rep <0.2321.1> to node "192.168.69.102:8092" is done normally, total tokens: 42, available tokens: 42,(active reps: 0, waiting reps: 0)

Since this defect is that both of the other two REST end-points don't appear to have any affect here's an example changing "settings/replication". This example was on a clean cluster, i.e. no other settings have been changed. Only creating bucket and replication + client writes has been performed.

root@localhost logs]# curl -u Administrator:password http://localhost:8091/settings/replications/ -d maxConcurrentReps=48
{"maxConcurrentReps":48,"checkpointInterval":1800,"docBatchSizeKb":2048,"failureRestartInterval":30,"workerBatchSize":500,"connectionTimeout":180,"workerProcesses":4,"httpConnections":20,"retriesPerRequest":2,"optimisticReplicationThreshold":256,"socketOptions":{"keepalive":true,"nodelay":false},"supervisorMaxR":25,"supervisorMaxT":5,"traceDumpInvprob":1000}

Above shows that the JSON has acknowledged the value of 48 but the log files show no change. After much waiting and re-checking grep shows no evidence.

[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 48" | wc -l
0
[root@localhost logs]# grep "is done normally, total tokens:" xdcr.* | grep "total tokens: 32" | wc -l
7713

The same was observed for /internalSettings/

Found on both 2.5.1 and 3.0.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
This is because global settings affect new replications or replications without per-replication settings defined. UI always defines all per-replication settings.
Comment by Jim Walker [ 16/Sep/14 ]
Have you pushed a documentation update for this?
Comment by Aleksey Kondratenko [ 16/Sep/14 ]
No. I don't own docs.




[MB-12188] we should not duplicate log messages if we already have logs with "repeated n times" template Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MB-12188.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
please see screenshot,

think that logs without "repeated n times" are unnecessary

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
They _are_. The logic (and it's same logic as many logging products have) is _if_ in short period of time (say 5 minutes) you have a bunch of same messages, it'll log them once. But if periods between messages is larger, then they're logged separately.




[MB-12187] Webinterface is not displaying items above 2.5kb in size Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Philipp Fehre Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: MacOS, Webinterface

Attachments: PNG File document_size_couchbase.png    

 Description   
When trying to display a document which is above 2.5kb the web-interface will block the display. 2.5kb seems like a really low limit and is easily reach by regular documents, which makes using the web-interface inefficient especially when a bucket contains many documents that are close to this limit.
It makes sense to have a limit to not having to load really big documents into the interface but 2.5kb seems like a really low limit.

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
by design. Older browsers have trouble with larger docs. And there must be duplicate of this somewhere




[MB-12186] If flush can not be completed because of a timeout, we should not display a message "Failed to flush bucket" when it's still in progress Created: 15/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server, UI
Affects Version/s: 3.0.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Andrei Baranouski Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Attachments: PNG File MB-12186.png    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
When I tried to flush heavily loaded cluster I received "Failed To Flush Bucket" popup, in fact it not failed, but simply has not been completed for a set period of time(30 sec)?

expected behaviour: message like "flush is not complete, but continue..."

 Comments   
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
timeout is timeout. We can say "it timed out" be we cannot be sure if it's continuing or not.
Comment by Andrei Baranouski [ 15/Sep/14 ]
hm, we get timeout when removing bucket occurs much long, but we inform that the removal is still in progress, right?




[MB-12185] update to "couchbase" from "membase" in gerrit mirroring and manifests Created: 14/Sep/14  Updated: 15/Sep/14

Status: Open
Project: Couchbase Server
Component/s: build
Affects Version/s: 2.5.0, 2.5.1, 3.0-Beta
Fix Version/s: 3.0
Security Level: Public

Type: Task Priority: Blocker
Reporter: Matt Ingenthron Assignee: Chris Hillery
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
One of the key components of Couchbase is still only at github.com/membase and not at github.com/couchbase. I think it's okay to mirror to both locations (not that there's an advantage), but for sure it should be at couchbase and the manifest for Couchbase Server releases should be pointing to Couchbase.

I believe the steps here are as follows:
- Set up a github.com/couchbase/memcached project (I've done that)
- Update gerrit's commit hook to update that repository
- Change the manifests to start using that repository

Assigning this to build as a component, as gerrit is handled by the build team. Then I'm guessing it'll need to be handed over to Trond or another developer to do the manifest change once gerrit is up to date.

Since memcached is slow changing now, perhaps the third item can be done earlier.

 Comments   
Comment by Chris Hillery [ 15/Sep/14 ]
Actually manifests are owned by build team too so I will do both parts.

However, the manifest for the hopefully-final release candidate already exists, and I'm a teensy bit wary about changing it after the fact. The manifest change may need to wait for 3.0.1.
Comment by Matt Ingenthron [ 15/Sep/14 ]
I'll leave it to you to work out how to fix it, but I'd just point out that manifest files are mutable.
Comment by Chris Hillery [ 15/Sep/14 ]
The manifest we build from is mutable. The historical manifests recording what we have already built really shouldn't be.
Comment by Matt Ingenthron [ 15/Sep/14 ]
True, but they are. :) That was half me calling back to our discussion about tagging and mutability of things in the Mountain View office. I'm sure you remember that late night conversation.

If you can help here Ceej, that'd be great. I'm just trying to make sure we have the cleanest project possible out there on the web. One wart less will bring me to 999,999 or so. :)
Comment by Trond Norbye [ 15/Sep/14 ]
Just a FYI, we've been ramping up the changes to memcached, so it's no longer a slow moving component ;-)
Comment by Matt Ingenthron [ 15/Sep/14 ]
Slow moving w.r.t. 3.0.0 though, right? That means the current github.com/couchbase/memcached probably has the commit planned to be released, so it's low risk to update github.com/couchbase/manifest with the couchbase repo instead of membase.

That's all I meant. :)
Comment by Trond Norbye [ 15/Sep/14 ]
_all_ components should be slow moving with respect to 3.0.0 ;)




[MB-12184] Enable logging to a remote server Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: James Mauss Assignee: Cihan Biyikoglu
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
It would be nice to be able to configure Couchbase Server to log events into a remote syslog-ng or the like server.




[MB-12183] View Query Thruput regression compared with previous and 2.5.1 builds Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thomas Anderson Assignee: Harsha Havanur
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4xnode cluster; 2xSSD

Issue Links:
Duplicate
duplicates MB-11917 One node slow probably due to the Erl... Open
Triage: Untriaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.29.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.30.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.31.zip
http://ci.sc.couchbase.com/job/leto/597/artifact/172.23.100.32.zip
Is this a Regression?: Yes

 Description   
query thruput, 1 Bucket, 20Mx2KB, nonDGM, 4x1 views, 500 mutations/sec/node.
performance on 2.5.1 - 2185; on 3.0.0-1205 (RC2) 1599; on 3.0.0-1208 (RC3) 1635; on 3.0.0-1209 (RC4) 331.
92% regression with 2.5.1, 72% regression with 3.0.0-1208 (RC3)

 Comments   
Comment by Sriram Melkote [ 12/Sep/14 ]
Sarath looked at it. Data points:

- First run was fine, second run was slow
http://showfast.sc.couchbase.com/#/runs/query_thr_20M_leto_ssd/3.0.0-1209

- CPU utilization in second run was much less in on node 31, indicative of scheduler collapse
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_ssd_300-1209_0fb_access

So this is a duplicate of MB-11917
Comment by Thomas Anderson [ 12/Sep/14 ]
reboot of cluster, rerun with same paramters. 3.0.0-1209 now shows same performance as previous 3.0 builds. it is still a 25% regression to 2.5.1, but is now a duplicate of MB-11917 , assigned to 3.0.1, sporadic Erlang Scheduler slowdown on one node in cluster causing various performance and functional issues.
 
Comment by Thomas Anderson [ 12/Sep/14 ]
closed as duplicate of planned 3.0.1 fix for Erlang scheduler collapse, MB-11917




[MB-12182] XDCR@next release - unit test "asynchronize" mode of XmemNozzle Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: feature-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Unresolved Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 16h
Time Spent: Not Specified
Original Estimate: 16h

Epic Link: XDCR next release




[MB-12181] XDCR@next release - rethink XmemNozzle's configuration parameters Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: feature-backlog
Fix Version/s: None
Security Level: Public

Type: Task Priority: Major
Reporter: Xiaomei Zhang Assignee: Xiaomei Zhang
Resolution: Done Votes: 0
Labels: sprint1_xdcr
Remaining Estimate: 8h
Time Spent: Not Specified
Original Estimate: 8h

Epic Link: XDCR next release

 Description   
rethink XmemNozzle's configuration parameters. Some of them should be construction-time parameters, some of them are runtime parameters


 Comments   
Comment by Xiaomei Zhang [ 12/Sep/14 ]
https://github.com/Xiaomei-Zhang/couchbase_goxdcr_impl/commit/44921e06e141f0c9df9cfc4ab43d106643e9b766
https://github.com/Xiaomei-Zhang/couchbase_goxdcr_impl/commit/80a8a059201b9a61bbd1784abef96859670ac233




[MB-12180] Modularize the DCP code Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: techdebt-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
We need to modularize the DCP code so that we can write unit tests to ensure that we have fewer bugs and less regressions from future changes.




[MB-12179] Allow incremental pausable backfills Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Task Priority: Major
Reporter: Mike Wiederhold Assignee: Mike Wiederhold
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
Currently ep-engine requires that backfills run from start to end and cannot be paused. This creates a problem for a few reasons. First off, if a user has a large dataset then we will potentially need to backfill a large amount of data from disk and into memory. Without the ability to pause and resume a backfill we cannot control the memory overhead created from reading items off of disk. This can affect the resident ratio if the data that needs to be read by the backfill is large.

A second issue is that this means that we can only run one (or two if there are enough cpu cores) backfill at a time and all backfill must be run serially. In the future we plan on allowing more DCP connections to be created to a server. If many connections require backfill we may have some connections that do not receive data for an extended period of time because these connections are waiting for their backfills to be scheduled.




[MB-12178] Fix race condition in checkpoint persistence command Created: 12/Sep/14  Updated: 12/Sep/14  Resolved: 12/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.1
Fix Version/s: 2.5.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Mike Wiederhold Assignee: Gokul Krishnan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Thread 11 (Thread 0x43fcd940 (LWP 6218)):

#0 0x00000032e620d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032e6208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032e6208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00002aaaaaf345ca in Mutex::acquire (this=0x1e79ac48) at src/mutex.cc:79
#4 0x00002aaaaaf595bb in lock (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/locks.hh:48
#5 LockHolder (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/locks.hh:26
#6 VBucket::addHighPriorityVBEntry (this=0x1e79a880, chkid=7, cookie=0x1d396580) at src/vbucket.cc:234
#7 0x00002aaaaaf1b580 in EventuallyPersistentEngine::handleCheckpointCmds (this=0x1d494a00, cookie=0x1d396580, req=<value optimized out>,
    response=0x40a390 <binary_response_handler>) at src/ep_engine.cc:3795
#8 0x00002aaaaaf20228 in processUnknownCommand (h=0x1d494a00, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>) at src/ep_engine.cc:949
#9 0x00002aaaaaf2117c in EvpUnknownCommand (handle=<value optimized out>, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>)
    at src/ep_engine.cc:1050
---Type <return> to continue, or q <return> to quit---
#10 0x00002aaaaacc4de4 in bucket_unknown_command (handle=<value optimized out>, cookie=0x1d396580, request=0x1d3d6800, response=0x40a390 <binary_response_handler>)
    at bucket_engine.c:2499
#11 0x00000000004122f7 in process_bin_unknown_packet (c=0x1d396580) at daemon/memcached.c:2911
#12 process_bin_packet (c=0x1d396580) at daemon/memcached.c:3238
#13 complete_nread_binary (c=0x1d396580) at daemon/memcached.c:3805
#14 complete_nread (c=0x1d396580) at daemon/memcached.c:3887
#15 conn_nread (c=0x1d396580) at daemon/memcached.c:5744
#16 0x0000000000406355 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x1d396580) at daemon/memcached.c:6012
#17 0x00002b52b162df3c in event_process_active_single_queue (base=0x1d46ec80, flags=<value optimized out>) at event.c:1308
#18 event_process_active (base=0x1d46ec80, flags=<value optimized out>) at event.c:1375
#19 event_base_loop (base=0x1d46ec80, flags=<value optimized out>) at event.c:1572
#20 0x0000000000414e34 in worker_libevent (arg=<value optimized out>) at daemon/thread.c:301
#21 0x00000032e620673d in start_thread () from /lib64/libpthread.so.0
#22 0x00000032e56d44bd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x4a3d7940 (LWP 6377)):

#0 0x00000032e620d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032e6208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032e6208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000415e16 in notify_io_complete (cookie=<value optimized out>, status=ENGINE_SUCCESS) at daemon/thread.c:485
#4 0x00002aaaaaf5a857 in notifyIOComplete (this=0x1e79a880, e=..., chkid=7) at src/ep_engine.h:423
#5 VBucket::notifyCheckpointPersisted (this=0x1e79a880, e=..., chkid=7) at src/vbucket.cc:250
#6 0x00002aaaaaf038fd in EventuallyPersistentStore::flushVBucket (this=0x1d77e000, vbid=109) at src/ep.cc:2033
---Type <return> to continue, or q <return> to quit---
#7 0x00002aaaaaf2c9e9 in doFlush (this=0x18c70dc0, tid=1046) at src/flusher.cc:222
#8 Flusher::step (this=0x18c70dc0, tid=1046) at src/flusher.cc:152
#9 0x00002aaaaaf36e74 in ExecutorThread::run (this=0x1d4c28c0) at src/scheduler.cc:159
#10 0x00002aaaaaf3746d in launch_executor_thread (arg=<value optimized out>) at src/scheduler.cc:36
#11 0x00000032e620673d in start_thread () from /lib64/libpthread.so.0
#12 0x00000032e56d44bd in clone () from /lib64/libc.so.6


 Comments   
Comment by Mike Wiederhold [ 12/Sep/14 ]
http://review.couchbase.org/#/c/41363/
Comment by Gokul Krishnan [ 12/Sep/14 ]
Thanks Mike!




[MB-12177] document SDK usage of CA and self-signed certs Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Major
Reporter: Matt Ingenthron Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt: finish-start
has to be done after MB-12173 SSL certificate should allow importin... Open

 Description   
To be done after Couchbase Server supports this.




[MB-12176] Missing port number on the network ports documentation for 3.0 Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: 3.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown




[MB-12175] Need a way to enforce SSL for admin and data access Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: feature-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Don Pinto
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
today we allow both unencrypted and encrypted communication and one can use firewalls to control which one stays available to communicating with couchbase server. it would be great to have a way to enforce secure communication through a switch and disable any unencrypted access to help compliance with security standards easily.




[MB-12174] Clarification on SSL communication documentation for 3.0 Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Cihan Biyikoglu Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown




[MB-12173] SSL certificate should allow importing certs besides server generated certs Created: 12/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: None
Affects Version/s: 3.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Cihan Biyikoglu Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt: finish-start
has to be done before MB-12177 document SDK usage of CA and self-sig... Open
Triage: Untriaged
Is this a Regression?: Unknown

 Comments   
Comment by Matt Ingenthron [ 12/Sep/14 ]
Existing SDKs should be compatible with this, but importing the CA certs will need to be documented.




[MB-12172] UI displays duplicate warnings after gracelful failover when >1 replica configured Created: 11/Sep/14  Updated: 11/Sep/14  Resolved: 11/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
I setup a bucket with 3 replicas (4 nodes) and performed a graceful failover on one node. The "server nodes" screen now displays both:
Fail Over Warning: Additional active servers required to provide the desired number of replicas!
and
Fail Over Warning: Rebalance recommended, some data does not have the desired replicas configuration!


Seems a bit duplicative and also not the same behavior you see after graceful failover with only one replica configured.

 Comments   
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Those are not exact duplicates. One is saying "in order to take advantage of 2 replicas you need 3 nodes at least". And second is saying "some of your configured replicas are missing and it will be fixed by rebalance".
Comment by Perry Krug [ 11/Sep/14 ]
Okay, I'll let that slide ;)




[MB-12171] Typo missing space on point 4 couchbase data files Created: 11/Sep/14  Updated: 11/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 1.8.0, 2.0.1, 2.1.0, 2.2.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Patrick Varley Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: documentation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-data-files
http://docs.couchbase.com/couchbase-manual-2.1/#couchbase-data-files
http://docs.couchbase.com/couchbase-manual-2.0/#couchbase-data-files
http://docs.couchbase.com/couchbase-manual-1.8/#couchbase-data-files

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Point 4 needs a space between and monitor.

Start the service again andmonitor the “warmup” of the data.

 Comments   
Comment by Ruth Harris [ 11/Sep/14 ]
Fixed in 2.5. N/A in 3.0




[MB-12170] Memory usage did not go down after flush Created: 10/Sep/14  Updated: 12/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.5.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Wayne Siu Assignee: Gokul Krishnan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Microsoft Windows Server 2008 R2 Enterprise
[info] OS Version : 6.1.7601 Service Pack 1 Build 7601
[info] HW Platform : PowerEdge M420
[info] CB Version : 2.5.0-1059-rel-enterprise
[info] CB Uptime : 31 days, 10 hours, 3 minutes, 51 seconds
[info] Architecture : x64-based PC
[ok] Installed CPUs : 16
[ok] Installed RAM : 98259 MB
[warn] Server Quota : 81.42% of total RAM. Max recommended is 80.00%
        (Quota: 80000 MB, Total RAM: 98259 MB)
[ok] Erlang VM vsize : 546 MB
[ok] Memcached vsize : 142 MB
[ok] Swap used : 0.00%
[info] Erlang VM scheduler : swt low is not set

Issue Links:
Relates to
relates to MB-9992 Memory is not released after 'flush' Closed
Triage: Untriaged
Operating System: Windows 64-bit
Is this a Regression?: Unknown

 Description   
Original problem was reported by our customer.

Steps to reproduce in their setup:
- Setup 4 node cluster (probably does not matter) bucket with 3GB, Replication of 1

- The program write 10MB binary objects from 3 threads parallely, 50 items in each thread.
Run the program (sometimes it crashes, I do not know the reason), simply run it again.
At the end of the run, there is a difference of 500 MB in ep_kv_size to the sum of vb_active_itm_memory and vb_replica_itm_memory (this might depend much on the network speed, I am using just a 100Mbit connection to the server, on production we have a faster network of course)
- Do the flush, ep_kv_size has the size of the difference even though the bucket is empty.
- Repeat this. On each run, the resident items percentage will go down.
- On the fourth or fifth run, it will throw an hard memory error, after insert only a part of the 150 items.




 Comments   
Comment by Wayne Siu [ 10/Sep/14 ]
Raju,
Can you please assign?
Comment by Raju Suravarjjala [ 10/Sep/14 ]
Tony, can you see if you can reproduce this bug? Please note it is 2.5.1 Windows 64bit
Comment by Anil Kumar [ 10/Sep/14 ]
Just a FYI previously we had opened similar issue which was on CentOS but resolved as cannot reproduce.
Comment by Ian McCloy [ 11/Sep/14 ]
It's 2.5.0 not 2.5.1 on Windows 2008 64bit
Comment by Thuan Nguyen [ 11/Sep/14 ]
Follow instruction from here,
Steps to reproduce in their setup:
- Setup 4 node cluster (probably does not matter) bucket with 3GB, Replication of 1

- The program write 10MB binary objects from 3 threads parallely, 50 items in each thread.
Run the program (sometimes it crashes, I do not know the reason), simply run it again.
At the end of the run, there is a difference of 500 MB in ep_kv_size to the sum of vb_active_itm_memory and vb_replica_itm_memory (this might depend much on the network speed, I am using just a 100Mbit connection to the server, on production we have a faster network of course)
- Do the flush, ep_kv_size has the size of the difference even though the bucket is empty.
- Repeat this. On each run, the resident items percentage will go down.
- On the fourth or fifth run, it will throw an hard memory error, after insert only a part of the 150 items.


I could not reproduce this bug after 6 flushes.
After each flush, mem use on both active and replica went down to zero.
Comment by Thuan Nguyen [ 11/Sep/14 ]
Using our loader, I could not reproduce this bug. I will use customer loader to test again.
Comment by Raju Suravarjjala [ 12/Sep/14 ]
Gokul: As we discussed can you folks try to reproduce this bug?




[MB-12169] Unexpected disk creates during graceful failover Created: 10/Sep/14  Updated: 10/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Perry Krug Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
4-node cluster with beer-sample bucket plus 300k items. Workload is 50/50 gets/sets, but sets are over same 300k items constantly.

When I do a graceful failover of one node, I see a fair amount of disk creates even though no new data is being inserted.

If there is a reasonable explanation great, but I am concerned that there may be something incorrect going on either with the identification of new data or the movement of vbuckets.

Logs are here:
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-193-230-57.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-23-198.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-29-139.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/perry/diskcreates/collectinfo-2014-09-10T205907-ns_1%40ec2-54-215-40-174.us-west-1.compute.amazonaws.com.zip




[MB-12168] Documentation: Clarification around server RAM quota best practice Created: 10/Sep/14  Updated: 10/Sep/14

Status: Open
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.5.1
Fix Version/s: None
Security Level: Public

Type: Improvement Priority: Minor
Reporter: Brian Shumate Assignee: Ruth Harris
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
The sizing[1] and RAM quota[2] documentation should be more clear about the specific best practices around general server RAM quota of no greater than 80% physical RAM per node on nodes with 16GB or more, or no greater than 60% on nodes with less than 16GB.

Emphasizing that the 20% or 40% remainder of RAM is required for the operating system, file system caches, and so on would be helpful as well.

Additionally, the RAM quota sub-section of the Memory quota section[3] reads as if it is abruptly cut off or otherwise incomplete:

--------
RAM quota

You will not be able to allocate all your machine RAM to the per_node_ram_quota as there may be other programs running on your machine.
--------

1. http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#couchbase-bestpractice-sizing
2. http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#ram-quotas
3. http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#memory-quota






[MB-12167] Remove Minor / Major / Page faults graphs from the UI Created: 10/Sep/14  Updated: 16/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 2.5.1, 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Trivial
Reporter: Ian McCloy Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 1
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Customers often ask what is wrong with their system when they see anything greater than 0 page faults in the UI graphs. What are customers supposed to do with the information ? This isn't a useful metric to customers and we shouldn't show it in the UI. If needed for development debug we can query it from the REST API.

 Comments   
Comment by Matt Ingenthron [ 10/Sep/14 ]
Just to opine: +1. There are a number of things in the UI that aren't actionable. I know they help us when we look back over time, but as presented it's not useful.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
So it's essentially expression of our belief that majority of our customers are ignorant enough to be confused by "fault" in name of this stat ?

Just want to make sure that there's no misunderstanding on this.

On Matt's point I'd like to say that all our stats are not actionable. They're just information that might end up helpful occasionally. And yes especially major page faults are _tremendously_ helpful sign of issues.
Comment by Matt Ingenthron [ 10/Sep/14 ]
I don't think the word "fault" is at issue, but maybe others do. I know there are others that aren't actionable and to be honest, I take issue with them too. This one is just one of the more egregious examples. :) The problem is, in my opinion, it's not clear what one would do with minor page fault data. One can't really know what's good or bad without looking at trends or doing further analysis.

While I'm tossing out opinions, similarly visualizing everything as a queue length isn't always good. To the app, latency and throughput matter-- how many queues and where they are affects this, but doesn't define it. A big queue length with fast storage can still have very good latency/throughput and equally a short queue length with slow or variable (i.e., EC2 EBS) storage can have poor latency/throughput. An app that will slow down with higher latencies won't make the queue length any bigger.

Anyway, pardon the wide opinion here-- I know you know all of this and I look forward to improvements when we get to them.

You raise a good point on major faults though.

If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there). I'm merely a user here, so please discount my opinion accordingly!
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> If it only helps occasionally, then it's consistent with the request (to remove it from the UI, but still have it in there).

Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?
Comment by Matt Ingenthron [ 10/Sep/14 ]
>> Well but then it's true for almost all of our stats isn't? Doesn't it mean that we need to hide them all then ?

I don't think so. That's an extreme argument. I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories. Do we account for minor faults at a per-bucket level? ;)
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
>> I'd put ops/s which is directly proportional to application load and minor faults which is affected by other things on the system in very different categories.

True.

>> Do we account for minor faults at a per-bucket level? ;)

No. And good point. Indeed lacking better UI we show all system stats (including some high-usefulness category things like count of memcached connections) as part of showing any bucket's stats. Despite gathering and storing system stats separately.

In any case, I'm not totally against hiding page fault stats. It's indeed minor topic.

But I'd like to see good reason for that. Because for _anything_ that we do there will all be some at least one user that's confused, which isn't IMO valid reason for "lets hide it".
 
My team has spent some effort getting this stats and we did for specifically because we knew that major page faults is important to be aware of. And we also know that on linux even minor page faults might be "major" in terms of latency impact. We've seen it with our own eyes.

I.e. when you're running out of list of free page, one can think that Linux is just supposed to grab one of clean pages from page cache, but we've seen this to take seconds for reason's I'm not quite sure. It does look like linux might routinely delay minor page fault for IO (perhaps due to some locking impacts). And things like huge page "minor" page fault may have even more obviously hard effect (i.e. because you need physically contiguous run of memory, getting this might require "memory compaction", locking etc). And our system doing constant non-direct-io writes routinely hits this hard condition. I.e. because near-every write from ep-engine or view engine has to allocate brand new page(s) for that data due to append-onlyness of out design (forest db's direct io path plus custom buffer cache management should help dramatically here).

Comment by Patrick Varley [ 10/Sep/14 ]
I think there are three main consumers of stats:

* Customers (cmd_get)
* Support (ep_bg_load_avg)
* Sevelopers of the component (erlang memory atom_used)

As a result we display and collect these stats in different way i.e UI, cbstats, ns_doctor, etc

A number of our users find the amount of stats in the UI overwhelming, a lot of the time they do not know which one are important.

Some of our user do not even understand what a virtual memory system is let alone what a page fault is.

I do not think we should display the page faults in the UI, but we should still collect them. I believe we can make better use of the space in the ui. For example: network usage or byte_written or byte_read, tcp retransmissions, Disk performance.
Comment by David Haikney [ 11/Sep/14 ]
+ 1 for removing page faults. The justification:
* We put them front and centre of the UI. Customers see Minor faults, Major Faults and Total faults before # gets, # sets.
* They have not proven useful for support in diagnosing an issue. In fact they cause more "false positive" questions ("my minor faults look high, is that a problem?")
* Overall this constitutes "noise" that our customers can do without. The stats can quite readily be captured elsewhere if we want to record them.

It would be easy to expand this into a wider discussion of how we might like to reorder / expand all of the current graphs in the UI - and that's a useful discussion. But I propose we keep this ticket to the question of removing the page fault stats.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41333
Comment by Ian McCloy [ 16/Sep/14 ]
Which version of Couchbase Server is this fixed in ?




[MB-12166] Linux: Warnings on install are poorly formatted and unlikely to be read by a user. Created: 10/Sep/14  Updated: 10/Sep/14

Status: Open
Project: Couchbase Server
Component/s: installer
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Dave Rigby Assignee: Bin Cui
Resolution: Unresolved Votes: 0
Labels: supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos 6

Attachments: PNG File Screen Shot 2014-09-10 at 15.21.55.png    
Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Unknown

 Description   
When installing the 3.0 RPM, we check for various OS settings and print warnings if they don't meet our recommendations.

This is a great idea in principle, but the actual output isn't very well presented, meaning users are (IMHO) likely to not spot the issues which are being raised.

I've attached a screenshot to show this exactly as displayed in the console, but the verbatim text is:

---cut ---
$ sudo rpm -Uvh couchbase-server-enterprise_centos6_x86_64_3.0.0-1209-rel.rpm
Preparing... ########################################### [100%]
Warning: Transparent hugepages may be used. To disable the usage
of transparent hugepages, set the kernel settings at runtime with
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Warning: Transparent hugepages may be used. To disable the usage
of transparent hugepages, set the kernel settings at runtime with
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Warning: Swappiness is not 0.
You can set the swappiness at runtime with
sysctl vm.swappiness=0
Minimum RAM required : 4 GB
System RAM configured : 0.97 GB

Minimum number of processors required : 4 cores
Number of processors on the system : 1 cores

   1:couchbase-server ########################################### [100%]
Starting couchbase-server[ OK ]

You have successfully installed Couchbase Server.
Please browse to http://localhost.localdomain:8091/ to configure your server.
Please refer to http://couchbase.com for additional resources.

Please note that you have to update your firewall configuration to
allow connections to the following ports: 11211, 11210, 11209, 4369,
8091, 8092, 18091, 18092, 11214, 11215 and from 21100 to 21299.

By using this software you agree to the End User License Agreement.
See /opt/couchbase/LICENSE.txt.
$
---cut ---

A couple of observations:

1) Everything is run together, including informational things (Preparing, Installation successful), things the user should act on (Warning: Swappiness, THP, Firewall information).

2) It's not very clear how serious some of these messages are - Is the fact I'm running with 1/4 of the minimum RAM just a minor thing, or a showstopper? Similary with THP - Support have seen on many occasions this can can cause false-positive fail overs, but we just casually say here:

"Warning: Transparent hugepages may be used. To disable the usage of transparent hugepages, set the kernel settings at runtime with echo never > /sys/kernel/mm/transparent_hugepage/enabled"


Suggestions:

1) Make the Warnings more pronounced - e.g prefix with "[WARNING]" and add some blank lines between things

2) Make clearer why these things are listed - linking back to more detailed information in our install guide if necessary. For example: "THP may cause slowdown of the cluster manager and false positive fail overs. Couchbase recommend disabling it. See http://docs.couchbase.com/THP for more details."

3) For things like THP which we can actually fix, ask the user if they want them fixed - after all we are already root if we are installing - e.g. "THP bad.... Would you like to change system THP setting to be changed to the recommended value (madvise) (y/n)?"

4) For things we can't fix (low memory, low CPUs) make the user confirm their decision to continue - e.g. "CPUs below minimum. Couchbase recommends at least XXX for production systems. Please type "test system" to continue installation.



 Comments   
Comment by David Haikney [ 10/Sep/14 ]
+1 from me - we can clearly improve the presentation here. I expect making the install interactive ("should I fix THP?") could be difficult. Are there existing precedents we can refer to here to help consistency?
Comment by Dave Rigby [ 10/Sep/14 ]
@DaveH: Admittedly I don't think they use RPM, but VMware guest tools springs to mind - they present the user a number of questions when installing - "do you want to automatically update kernel modules?", "do you want to use printer sharing", etc.

Admittedly they don't have a secondary config stage unlike us with our GUI, *but* if we are going to fix things like THP, swappiness, then we need to be root to do so (and so install-time is the only option).




[MB-12165] UI: Log - Collect Information. Upload options text boxes should be 'grayed out' when "Upload to couchbase" is not selected. Created: 10/Sep/14  Updated: 10/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Jim Walker Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: log, ui
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos 6 CB server (1 node cluster, VirtualBox VM)
Client browsers all running on OSX 10.9.4

Triage: Untriaged
Operating System: MacOSX 64-bit
Is this a Regression?: Unknown

 Description   
Couchbase Server Version: 3.0.0 Enterprise Edition (build-1208)

When going to the log upload area of the UI I found that all text boxes in the Upload Options section are read only with out any visual indicator.

It took a bit of clicking and checking browser liveness that it was because the check box "Upload to couchbase" was not checked.

The input boxes should be grayed out or some other visual indicator showing they're not usable.

* Tested with Chrome 37.0.2062.120
* Tested with Safari 7.0.6 (9537.78.2)




[MB-12164] UI: Cancelling a pending add should not show "reducing capacity" dialog Created: 10/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Improvement Priority: Trivial
Reporter: David Haikney Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
3.0.0 Beta build 2

Steps to reproduce:
In the UI click "Server add".
Add the credentials for a server to be added
In the Pending Rebalance pane click "Cancel"

Actual Behaviour:
See a dialog stating"Warning – Removing this server from the cluster will reduce cache capacity across all data buckets. Are you sure you want to remove this server?"

Expected behaviour:
Dialog is not applicable in this context since not adding an unaided node will do nothing to the cluster capacity. Would expect either no dialog or a dialog acknowledging that "This node will no longer be added to the cluster on next rebalance"

 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
But it _is_ applicable because you're returning node to "pending remove" state.
Comment by David Haikney [ 10/Sep/14 ]
A node that has never held any data or actively participated in the cluster cannot possibly reduce the cluster's capacity.
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
It looks like I misunderstood this request as referring to cancelling add-back after failover. Which it isn't.

Makes sense now.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41428




[MB-12163] Memcached Closing connection due to read error: Unknown error Created: 10/Sep/14  Updated: 10/Sep/14

Status: Open
Project: Couchbase Server
Component/s: memcached
Affects Version/s: 2.5.0, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Minor
Reporter: Ian McCloy Assignee: Dave Rigby
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: [info] OS Name : Microsoft Windows Server 2008 R2 Enterprise
[info] OS Version : 6.1.7601 Service Pack 1 Build 7601
[info] CB Version : 2.5.0-1059-rel-enterprise

Issue Links:
Dependency
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
The error message "Closing connection due to read error: Unknown error" doesn't explain what the problem is. Unfortunately on Windows we aren't parsing the error code properly. We need to call FormatMessage() not strerror().

Code At
http://src.couchbase.org/source/xref/2.5.0/memcached/daemon/memcached.c#5360




[MB-12162] Performance Test for Rebalance after failover fails Created: 09/Sep/14  Updated: 10/Sep/14  Resolved: 10/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: performance
Affects Version/s: 3.0
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Thomas Anderson Assignee: Thomas Anderson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4 node cluster, each node 16 core, 64G memory

Triage: Triaged
Operating System: Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump: ci.sc.couchbase.com:/tmp/Reb_Failover_100M_DGM_Views
{172.23.100.29.zip, 172.23.100.30.zip, 172.23.100.31.zip, 172.23.100.32.zip)
additionally in same folder
{node1_memory_usage, node2...}
Is this a Regression?: Unknown

 Description   
Rebalance after failover fails to complete. after ~ 4hrs of processing, hangs and makes no progress.
eventually beam.smp will declare itself out of memory (system log), and that beam.smp invoked comm.killer. memory allocated at time of failure is 62G of 64G. memcached has allocated ~40G.
other characteristics, 100M documents, DGM and 4 views.

this test passed on 3.0.0-1205 and earlier; fails on 3.0.0-1208.
similar test with only 20M documents , no views, no-DGM passed just prior

 Comments   
Comment by Thomas Anderson [ 10/Sep/14 ]
problem research shows that the memory utilization issue is a known issue with memcached, that under high load, memcached can consume memory to the point where other processes fail with 'no memory' conditions.
the working solution is to not allow memcached unlimited memory, using startup parameter -m. if memcached is limited to 50% of available memory, the O/S and Couchbase processes do not exhibit memory pressure issues.
Comment by Thomas Anderson [ 10/Sep/14 ]
see qualifying description tying problem to memcached memory issue.




[MB-12161] per-server UI does not refresh properly when adding a node Created: 09/Sep/14  Updated: 09/Sep/14

Status: Open
Project: Couchbase Server
Component/s: UI
Affects Version/s: 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Minor
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Triage: Untriaged
Is this a Regression?: Unknown

 Description   
Admittedly quite minor, but a little annoying.

When you're looking at a single stat across all nodes of a cluster (i.e active vbuckets):

-Add a new node to the cluster from another tab open to the UI
-Note that the currently open stats screen stops displaying graphs for the existing nodes and does not update that a new node has joined until you refresh the screen




[MB-12160] setWithMeta() is able to update a locked remote key Created: 09/Sep/14  Updated: 09/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aruna Piravi Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: all, 3.0.0-1208

Attachments: Zip Archive 10.3.4.186-992014-168-diag.zip     Zip Archive 10.3.4.188-992014-1611-diag.zip    
Triage: Untriaged
Is this a Regression?: No

 Description   
A simple test to check if setWithMeta() refrains from updating a locked key-

Steps
--------
1. uni-xdcr on default bucket from .186 --> .188
2. create a key 'pymc1098' with "old_value" on .186
3. sleep for 10 secs, it gets replicated to .188.
4. Now getAndLock() on 'pymc1098' on .188 for 20s
5. Meanwhile, update same key at .186
6. After 10s(lock should not have expired now, also see timestamps in test log below), do a getMeta() at source and dest, they match
Destination key contains "new_doc".


def test_replication_after_getAndLock_dest(self):
        src = MemcachedClient(host=self.src_master.ip, port=11210)
        dest = MemcachedClient(host=self.dest_master.ip, port=11210)
        self.log.info("Initial set = key:pymc1098, value=\"old_doc\" ")
        src.set('pymc1098', 0, 0, "old_doc")
       # wait for doc to replicate
        self.sleep(10)
       # apply lock on destination
        self.log.info("getAndLock at destination for 20s ...")
        dest.getl('pymc1098', 20, 0)
       # update source doc
        self.log.info("Updating 'pymc1098' @ source with value \"new_doc\"...")
        src.set('pymc1098', 0, 0, "new_doc")
        self.sleep(10)
        self.log.info("getMeta @ src: {}".format(src.getMeta('pymc1098')))
        self.log.info("getMeta @ dest: {}".format(dest.getMeta('pymc1098')))
        src_doc = src.get('pymc1098')
        dest_doc = dest.get('pymc1098')


2014-09-09 15:27:13 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] Initial set = key:pymc1098, value="old_doc"
2014-09-09 15:27:13 | INFO | MainProcess | test_thread | [xdcrbasetests.sleep] sleep for 10 secs for doc to be replicated ...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getAndLock at destination for 20s ...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] Updating 'pymc1098' @ source with value "new_doc"...
2014-09-09 15:27:23 | INFO | MainProcess | test_thread | [xdcrbasetests.sleep] sleep for 10 secs. ...
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getMeta @ src: (0, 0, 0, 2, 16849348715855509)
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] getMeta @ dest: (0, 0, 0, 2, 16849348715855509)
2014-09-09 15:27:33 | INFO | MainProcess | test_thread | [uniXDCR.test_replication_after_getAndLock_dest] src_doc = (0, 16849348715855509, 'new_doc')
dest_doc =(0, 16849348715855509, 'new_doc')

Will attach cbcollect.

 Comments   
Comment by Aruna Piravi [ 09/Sep/14 ]
Causes inconsistency when the server by itself disallows set but allows set through setWithMeta when locked.




[MB-12159] Memcached throws an irrelevant message while trying to update a locked key Created: 09/Sep/14  Updated: 09/Sep/14

Status: Open
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Aruna Piravi Assignee: Chiyoung Seo
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 3.0.0-1208

Triage: Untriaged
Is this a Regression?: No

 Description   
A simple test to see if updates are possible on locked keys

def test_lock(self):
        src = MemcachedClient(host=self.src_master.ip, port=11210)
        # first set
        src.set('pymc1098', 0, 0, "old_doc")
        # apply lock
        src.getl('pymc1098', 30, 0)
        # update key
        src.set('pymc1098', 0, 0, "new_doc")

throws the following Memcached error -

  File "pytests/xdcr/uniXDCR.py", line 784, in test_lock
    src.set('pymc1098', 0, 0, "new_doc")
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 163, in set
    return self._mutate(memcacheConstants.CMD_SET, key, exp, flags, 0, val)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 132, in _mutate
    cas)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 128, in _doCmd
    return self._handleSingleResponse(opaque)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 121, in _handleSingleResponse
    cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
  File "/Users/apiravi/Documents/testrunner/lib/mc_bin_client.py", line 117, in _handleKeyedResponse
    raise MemcachedError(errcode, rv)
MemcachedError: Memcached error #2 'Exists': Data exists for key for vbucket :0 to mc 10.3.4.186:11210






[MB-12158] erlang gets stuck in gen_tcp:send despite socket being closed (was: Replication queue grows unbounded after graceful failover) Created: 09/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 3.0-Beta
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dcp_proxy.beam    
Triage: Untriaged
Is this a Regression?: Unknown

 Description   
After speaking with Mike briefly, sounds like this may be a known issue. My apologies if there is a duplicate issue already filed.

Logs are here:
 https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-176-128-88.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-193-231-33.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-111-249.us-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/customers.couchbase.com/perry/replicationqueuegrowth/collectinfo-2014-09-09T205123-ns_1%40ec2-54-219-84-241.us-west-1.compute.amazonaws.com.zip

 Comments   
Comment by Mike Wiederhold [ 10/Sep/14 ]
Perry,

The stats seem to be missing for dcp streams so I cannot look further into this. If you can still reproduce this on 3.0 build 1209 then assign it back to me and include the logs.
Comment by Perry Krug [ 11/Sep/14 ]
Mike, does the cbcollect_info include these stats or do you need me to gather something specifically when the problem occurs?

If not, let's also get them included for future builds...
Comment by Perry Krug [ 11/Sep/14 ]
Hey Mike, I'm having a hard time reproducing this on build 1209 where it seemed rather easy on previous builds. Do you think any of the changes from the "bad_replicas" bug would have affected this? Is it worth reproducing on a previous build where it was easier in order to get the right logs/stats or do you think it may be fixed already?
Comment by Mike Wiederhold [ 11/Sep/14 ]
This very well could be related to MB-12137. I'll take a look at the cluster and if I don't find anything worth investigating further then I think we should close this as cannot reproduce since it doesn't seem to happen anymore on build 1209. If there is still a problem I'm sure it will be reproduced again later in one of our performance tests.
Comment by Mike Wiederhold [ 11/Sep/14 ]
It looks like one of the dcp connections to the failed over node was still active. My guess is that the node when down and came back up quickly. As a result it's possible that ns_server re-established the connection with the downed node. Can you attach the logs and assign this to Alk so he can take a look?
Comment by Perry Krug [ 11/Sep/14 ]
Thanks Mike.

Alk, logs are attached from the first time this was reproduced. Let me know if you need me to do so again.

Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Mike, btw for the future, if you could post exact details (i.e. node and name of connection) of stuff you want me to double-check/explain it could have saved me time.

Also, let me note that it's replica and node master who establishes replication. I.e. we're "pulling" rather than "pushing" replication.

I'll look at all this and see if I can find something.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Sorry, replica instead of master, who initiates replication.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Indeed I'm seeing dcp connection from memcached on .33 to beam of .88. And it appears that something in dcp replicator is stuck. I'll need a bit more time to figure this out.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Looks like socket send gets blocked somehow despite socket actually being closed already.

Might be serious enough to be a show stopper for 3.0.

Do you by any chance still have nodes running? Or if not, can you easily reproduce this? Having direct access to bad node might be very handy to diagnose this further.
Comment by Aleksey Kondratenko [ 11/Sep/14 ]
Moved back to 3.0. Because if it's indeed erlang bug it might be very hard to fix and because it may happen not just during failover.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage - need and update pls.
Comment by Perry Krug [ 12/Sep/14 ]
I'm reproducing now and will post both the logs and the live systems momentarily
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Able to reproduce this condition with erlang outside of our product (which is great news):

* connect gen_tcp socket to nc or irb process listening

* spawn erlang process that will send stuff infinitely on that socket and will eventually block

* from erlang console do gen_tcp:close (i.e. while other erlang process is blocked writing)

* observe how erlang process that's blocked is still blocked

* observe with lsof that socket isn't really closed

* close the socket on the other end (by killing nc)

* observe with lsof that socket is closed

* observe how erlang process is still blocked (!) despite underlying socket fully dead

The fact that it's not a race is really great because dealing with deterministic bug (even if it's "feature" from erlang's point of view) is much easier
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Fix is at: http://review.couchbase.org/41396

I need approval to get this in 3.0.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Attaching fixed dcp_proxy.beam if somebody wants to be able to test the fix without waiting for build
Comment by Perry Krug [ 12/Sep/14 ]
Awesome as usual Alk, thanks very much.

I'll give this a try on my side for verification.
Comment by Parag Agarwal [ 12/Sep/14 ]
Alk, will this issue occur in TAP as well? during upgrades.
Comment by Mike Wiederhold [ 12/Sep/14 ]
Alk,

I apologize or not including a better description of what happened. In the future I'll make sure to leave better details before assigning bugs to others so that we don't have multiple people duplicating the same work.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> Alk, will this issue occur in TAP as well? during upgrades.

No.
Comment by Perry Krug [ 12/Sep/14 ]
As of yet unable to reproduce this on build 1209+dcp_proxy.beam.

Thanks for the quick turnaround Alk.
Comment by Cihan Biyikoglu [ 12/Sep/14 ]
triage discussion:
under load this may happen frequently -
there is good chance that this recovers itself in few mins - it should but we should validate.
if we are in this state, we can restart erlang to get out of the situation - no app unavailability required
fix could be risky to take at this point

decision: not taking this for 3.0
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Mike, need you ACK on this:

Because of dcp nops between replicators, dcp producer should after few minutes, close his side of the socket and release all resources.

Am I right? I said this in meeting just few minutes ago and it affected decision. If I'm wrong (say if you decided to disable nops in the end, or if you know it's broken etc), then we need to know it.
Comment by Perry Krug [ 12/Sep/14 ]
FWIW, I have seen that this does not recover after a few minutes. However, I agree that it is workaround-able both by restarting beam or bringing the node back into the cluster. Unless we think this will happen much more often, I agree it could be deferred out of 3.0.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
Well if it does not recover then it can be argued that we have another bug on ep-engine side that may lead to similar badness (queue size and resources eated) _without_ clean workaround.

Mike, we'll need your input on DCP NOPs.
Comment by Mike Wiederhold [ 12/Sep/14 ]
I was curious about this myself. As far as I know the noop code is working properly and we have some tests to make sure it is. I can work with Perry to try to figure out what is going on on the ep-engine side and see if the noops are actually being sent. I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.
Comment by Aleksey Kondratenko [ 12/Sep/14 ]
>> I know this sounds unlikely, but I was curious whether or not the noops were making it through to the failed over node for some reason.

I can rule this out. We do have connection between destination's beam and source's memcached. And we _dont_ have connection to beam's connection to destination memcached anymore. Erlang is stuck writing to dead socket. So there's no way you could get nop acks back.
Comment by Perry Krug [ 15/Sep/14 ]
I've confirmed that this state persists for much longer than a few minutes...I've not ever seen it recover itself, and have left it to run for 15-20 minutes at least.

Do you need a live system to diagnose?
Comment by Cihan Biyikoglu [ 15/Sep/14 ]
thanks for the update - Mike, sounds like we should open an issue for DCP to reliably detect these conditions. We should add this in for 3.0.1.
Perry, Could you confirm restarting the erlang process resolves the issue Perry?
thanks
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41410

Mike will open different ticket for NOPs in DCP.




[MB-12157] Intrareplication falls behind OPs causing data loss situation Created: 09/Sep/14  Updated: 09/Sep/14

Status: Open
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 3.0.1, 3.0, 3.0-Beta
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thomas Anderson Assignee: Thomas Anderson
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 4 node cluster; 4 core nodes; beer-sample application run at 60Kops (50/50 ratio), nodes provisioned on RightScale EC2 x1.large images

Triage: Untriaged
Operating System: Centos 64-bit
Is this a Regression?: Yes

 Description   
the intra-replication queue grows to unacceptable limits, exposing dataloss of multiple seconds of queued replication.
the problem is more pronounced on the RightScale provision cluster, but can be seen on local physical clusters with long enough test run (>20min). recovery requires stopping input request queue.
initial measurements of the erlang process suggests that minor retries on scheduled network i/o eventually build up into a limit for push of replication data, scheduler_wait appears to be the consuming element, epoll_wait counter increases per measurement, as does the mean time wait, suggesting thrashing in the erlang event scheduler. there are various papers/presentations that suggest Erlang is sensitive to the balance of tasks (a mix of long event and short event can cause performance thruput issues).

cbcollectinfo logs will be attached shortly

 Comments   
Comment by Aleksey Kondratenko [ 09/Sep/14 ]
Still don't have any evidence. Cannot own this ticket until evidence is provided.




[MB-12156] time of check/time of use race in data path change code of ns_server may lead to deletion of all buckets after adding node to cluster Created: 09/Sep/14  Updated: 15/Sep/14  Resolved: 15/Sep/14

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0, 1.8.1, 2.0, 2.1.0, 2.2.0, 2.1.1, 2.5.0, 2.5.1, 3.0
Fix Version/s: 3.0.1
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates to
Triage: Untriaged
Is this a Regression?: No

 Description   
SUBJ.

In code that changes data path we first check if node is provisioned (without preventing provision-ness to be changed after that) and the proceed with change of data path. As part of change of data path we delete buckets.

So if node gets added to cluster after check but before data path is actually changed, we'll delete all buckets of cluster.

As improbable as it may seem, it actually occurred in practice. See CBSE-1387.


 Comments   
Comment by Aleksey Kondratenko [ 10/Sep/14 ]
Whether it's a must have for 3.0.0 is not for me to decide but here's my thinking.

* the bug was there at least since 2.0.0 and it really requires something outstanding in customer's environment to actually occur

* 3.0.1 is just couple months away

* 3.0.0 is done

But if we're still open to adding this fix to 3.0.0, my team will surely be glad to do it.
Comment by Aleksey Kondratenko [ 15/Sep/14 ]
http://review.couchbase.org/41332
http://review.couchbase.org/41333




Generated at Tue Sep 16 13:23:08 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.