[MB-6706] [system test] rebalance hang when add nodes to cluster Created: 20/Sep/12  Updated: 10/Jan/13  Resolved: 03/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Filipe Manana
Resolution: Fixed Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1746

Attachments: PNG File ss_2012-09-27_at_4.02.17 PM.png     PNG File ss_2012-10-02_at_11.33.33 AM.png    

 Description   
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- SSD disk format ext4 on /data
- Each server has its own drive, no disk sharing with other server.
- Load 15 million items to both buckets
- Cluster has 2 buckets, default (11GB) and saslbucket (11GB) with consistent view enable. For 2 buckets, we use only 68% total RAM of system.
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

* Create cluster with 4 nodes installed couchbase server 2.0.0-1746

10.6.2.37
10.6.2.38
10.6.2.39
10.6.2.40

* Data path /data
* View path /data

* Add 4 nodes to cluster and rebalance
10.6.2.42
10.6.2.43
10.6.2.44
10.6.2.45

* rebalance hang

* Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1746-reb-hang-20120920.tgz

Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1746-reb-hang-20120920.tgz

 Comments   
Comment by Thuan Nguyen [ 21/Sep/12 ]
I think this bug the same as bug MB-6707
Comment by Aleksey Kondratenko [ 24/Sep/12 ]
Filipe, there are no crashes and you can see in logs of .38 that we're waiting for index update (there's just 1 simple index) and this does not happen.


[ns_server:debug,2012-09-20T17:20:30.431,ns_1@10.6.2.38:<0.21339.32>:capi_set_view_manager:do_wait_index_updated:596]References to wait: [#Ref<0.0.309.194865>] ("saslbucket", 531)

I advise you to take a quick look at source of do_wait_index_updated in capi_set_view_manager. Maybe you will spot something that I'm not doing right.
Comment by Thuan Nguyen [ 27/Sep/12 ]
Hit this bug again in build 2.0.0-1777 with swap rebalance Add node 44, 45 and remove node 39, 40
Rebalance hang after moving some items to new added nodes

Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1777
- Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)


10.6.2.37
10.6.2.38
10.6.2.39
10.6.2.40
10.6.2.42
10.6.2.43

* Load 18 million items to both bucket. Each key has size from 512 bytes to 1024 bytes
* Queries all 4 views from 2 docs


10.6.2.44
10.6.2.45

* Data path /data
* View path /data

Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1777-swap-reb-hang-20120927-155552.tgz

Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1777-swap-reb-hang-20120927-155750.tgz

Comment by Thuan Nguyen [ 02/Oct/12 ]
Hit this bug again in build 2.0.0-1781 in system test.

* Add 2 nodes: 39 and 40 and rebalance. During rebalance, reboot node 42 and 43. Rebalance failed as expected.
* After node finished warmup, rebalance again. Rebalance failed with bug MB-6490 on node 44.
* Failover node 44 and rebalance.
* Cluster rebalance saslbucket first. Rebalance was done after 17 hrs

Started rebalancing bucket saslbucket ns_rebalancer000 ns_1@10.6.2.37 14:44:27 - Mon Oct 1, 2012
Started rebalancing bucket default ns_rebalancer000 ns_1@10.6.2.37 08:14:08 - Tue Oct 2, 2012

** Rebalance of default bucket hang around 10:00AM Tue Oct 2, 2012 as in capture screen


Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1781
- Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
- Enable consistent view on cluster (by default)

10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43

* Load 14 million items to each bucket. Each key has size from 512 bytes to 1024 bytes
* Mutate 14 million items to each bucket with size of each key from 1024 to 1500 bytes
* Load running about 8K to 10K ops on both buckets
* Queries all 4 views from 2 docs


10.6.2.39
10.6.2.40

* Data path /data
* View path /data

Manifest info from build 1781
http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1781-rel.rpm.manifest.xml

Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/8nodes-col-info-1781-rebalance-hang-20121002-114333.tgz

Link to tap stats from all nodes https://friendpaste.com/6JqjtMOwZLvmlx5h9fxt6L
Comment by Thuan Nguyen [ 02/Oct/12 ]
Promote it to blocker since we hit it often in system test.
Comment by Thuan Nguyen [ 02/Oct/12 ]
I killed loads on default bucket (currently rebalancing but hang), rebalance started after five minutes all loads stopped. Few minutes later, restart half loads on default bucket, rebalance continues running
Comment by Thuan Nguyen [ 03/Oct/12 ]
Integrated in github-couchdb-preview #509 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/509/])
    MB-6706 Trigger update after defining indexable partitions (Revision 9098ff069968247556da72a2be1bbfd944b1d30e)

     Result = SUCCESS
pwansch :
Files :
* src/couch_set_view/src/couch_set_view_group.erl
Generated at Mon Sep 01 18:45:11 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.