[MB-6706] [system test] rebalance hang when add nodes to cluster Created: 20/Sep/12 Updated: 10/Jan/13 Resolved: 03/Oct/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | ns_server |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Thuan Nguyen | Assignee: | Filipe Manana |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | system-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 6.2 64bit build 2.0.0-1746 | ||
| Attachments: |
|
| Description |
|
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU - Each server has 32 GB RAM and 400 GB SSD disk. - SSD disk format ext4 on /data - Each server has its own drive, no disk sharing with other server. - Load 15 million items to both buckets - Cluster has 2 buckets, default (11GB) and saslbucket (11GB) with consistent view enable. For 2 buckets, we use only 68% total RAM of system. - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) * Create cluster with 4 nodes installed couchbase server 2.0.0-1746 10.6.2.37 10.6.2.38 10.6.2.39 10.6.2.40 * Data path /data * View path /data * Add 4 nodes to cluster and rebalance 10.6.2.42 10.6.2.43 10.6.2.44 10.6.2.45 * rebalance hang * Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1746-reb-hang-20120920.tgz Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1746-reb-hang-20120920.tgz |
| Comments |
| Comment by Thuan Nguyen [ 21/Sep/12 ] |
|
I think this bug the same as bug |
| Comment by Aleksey Kondratenko [ 24/Sep/12 ] |
|
Filipe, there are no crashes and you can see in logs of .38 that we're waiting for index update (there's just 1 simple index) and this does not happen. [ns_server:debug,2012-09-20T17:20:30.431,ns_1@10.6.2.38:<0.21339.32>:capi_set_view_manager:do_wait_index_updated:596]References to wait: [#Ref<0.0.309.194865>] ("saslbucket", 531) I advise you to take a quick look at source of do_wait_index_updated in capi_set_view_manager. Maybe you will spot something that I'm not doing right. |
| Comment by Thuan Nguyen [ 27/Sep/12 ] |
|
Hit this bug again in build 2.0.0-1777 with swap rebalance Add node 44, 45 and remove node 39, 40
Rebalance hang after moving some items to new added nodes Cluster information: - 8 centos 6.2 64bit server with 4 cores CPU - Each server has 32 GB RAM and 400 GB SSD disk. - 24.8 GB RAM for couchbase server at each node - SSD disk format ext4 on /data - Each server has its own SSD drive, no disk sharing with other server. - Create cluster with 6 nodes installed couchbase server 2.0.0-1777 - Cluster has 2 buckets, default (12GB) and saslbucket (12GB). - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) 10.6.2.37 10.6.2.38 10.6.2.39 10.6.2.40 10.6.2.42 10.6.2.43 * Load 18 million items to both bucket. Each key has size from 512 bytes to 1024 bytes * Queries all 4 views from 2 docs 10.6.2.44 10.6.2.45 * Data path /data * View path /data Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1777-swap-reb-hang-20120927-155552.tgz Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1777-swap-reb-hang-20120927-155750.tgz |
| Comment by Thuan Nguyen [ 02/Oct/12 ] |
|
Hit this bug again in build 2.0.0-1781 in system test.
* Add 2 nodes: 39 and 40 and rebalance. During rebalance, reboot node 42 and 43. Rebalance failed as expected. * After node finished warmup, rebalance again. Rebalance failed with bug * Failover node 44 and rebalance. * Cluster rebalance saslbucket first. Rebalance was done after 17 hrs Started rebalancing bucket saslbucket ns_rebalancer000 ns_1@10.6.2.37 14:44:27 - Mon Oct 1, 2012 Started rebalancing bucket default ns_rebalancer000 ns_1@10.6.2.37 08:14:08 - Tue Oct 2, 2012 ** Rebalance of default bucket hang around 10:00AM Tue Oct 2, 2012 as in capture screen Cluster information: - 8 centos 6.2 64bit server with 4 cores CPU - Each server has 32 GB RAM and 400 GB SSD disk. - 24.8 GB RAM for couchbase server at each node - SSD disk format ext4 on /data - Each server has its own SSD drive, no disk sharing with other server. - Create cluster with 6 nodes installed couchbase server 2.0.0-1781 - Cluster has 2 buckets, default (12GB) and saslbucket (12GB). - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) - Enable consistent view on cluster (by default) 10.6.2.37 10.6.2.38 10.6.2.44 10.6.2.45 10.6.2.42 10.6.2.43 * Load 14 million items to each bucket. Each key has size from 512 bytes to 1024 bytes * Mutate 14 million items to each bucket with size of each key from 1024 to 1500 bytes * Load running about 8K to 10K ops on both buckets * Queries all 4 views from 2 docs 10.6.2.39 10.6.2.40 * Data path /data * View path /data Manifest info from build 1781 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1781-rel.rpm.manifest.xml Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/8nodes-col-info-1781-rebalance-hang-20121002-114333.tgz Link to tap stats from all nodes https://friendpaste.com/6JqjtMOwZLvmlx5h9fxt6L |
| Comment by Thuan Nguyen [ 02/Oct/12 ] |
| Promote it to blocker since we hit it often in system test. |
| Comment by Thuan Nguyen [ 02/Oct/12 ] |
| I killed loads on default bucket (currently rebalancing but hang), rebalance started after five minutes all loads stopped. Few minutes later, restart half loads on default bucket, rebalance continues running |
| Comment by Thuan Nguyen [ 03/Oct/12 ] |
|
Integrated in github-couchdb-preview #509 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/509/]) Result = SUCCESS pwansch : Files : * src/couch_set_view/src/couch_set_view_group.erl |