[MB-6573] [longevity] rebalance failed due to error "Resetting rebalance status since it's not really running" when there are major page faults on some of the nodes in the cluster Created: 07/Sep/12 Updated: 10/Jan/13 Resolved: 30/Sep/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | ns_server |
| Affects Version/s: | 2.0-beta |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Thuan Nguyen | Assignee: | Karan Kumar |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 2.0-beta-release-notes, system-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 6.2 64bit build 2.0.0-1697 | ||
| Description |
|
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU - Each server has 10 GB RAM and 150 GB disk. - 8 GB RAM for couchbase server at each node (80% total system memmories) - Disk format ext3 on both data and root - Each server has its own drive, no disk sharing with other server. - Load 7 million items to both buckets - Cluster has 2 buckets, default (3GB) and saslbucket (3GB) - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) - Maintain load about 10 K ops, query view on both doc 10.3.121.13 10.3.121.14 10.3.121.15 10.3.121.16 10.3.121.17 10.3.121.20 10.3.121.22 10.3.121.24 10.3.121.25 10.3.121.23 Create cluster with 10 nodes Do swap rebalance. Add node 26 and remove node 25. Before and during rebalance, cluster does not go into swap. Rebalance failed with error "Resetting rebalance status since it's not really running" Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11ndoes-1697-reb-failed-reset-reb-20120907.tgz Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-11nodes-1697-reb-failed-reset-reb-20120907.tgz |
| Comments |
| Comment by Thuan Nguyen [ 07/Sep/12 ] |
|
couchdb:info,2012-09-07T18:46:04.076,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/51.couch.1 -> /data/saslbucket/51.couch.1.compact
[couchdb:info,2012-09-07T18:46:04.086,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/51" [couchdb:info,2012-09-07T18:46:04.095,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/51.couch.2 and /data/saslbucket/51.couch.1.compact. [couchdb:info,2012-09-07T18:46:04.104,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]Compaction for db "saslbucket/51" completed. [ns_server:info,2012-09-07T18:46:04.105,ns_1@10.3.121.13:<0.13562.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/52">> [couchdb:info,2012-09-07T18:46:04.106,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Starting compaction for db "saslbucket/52" [couchdb:info,2012-09-07T18:46:04.522,ns_1@10.3.121.13:<0.7547.5>:couch_log:info:39]10.3.121.15 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:debug,2012-09-07T18:46:04.668,ns_1@10.3.121.13:<0.13524.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok [ns_server:debug,2012-09-07T18:46:04.728,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>} [couchdb:info,2012-09-07T18:46:04.855,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200 [couchdb:info,2012-09-07T18:46:04.868,ns_1@10.3.121.13:<0.13271.5>:couch_log:info:39]Updater checkpointing set view `saslbucket` update for main group `_design/d11` [ns_server:debug,2012-09-07T18:46:04.864,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok [ns_server:debug,2012-09-07T18:46:04.904,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>} [couchdb:info,2012-09-07T18:46:04.965,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200 [user:info,2012-09-07T18:46:05.012,ns_1@10.3.121.13:ns_config:ns_janitor:maybe_stop_rebalance_status:139]Resetting rebalance status since it's not really running [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.018,ns_1@10.3.121.13:ns_config_rep:ns_config_rep:do_push_keys:317]Replicating some config keys ([rebalance_status,rebalancer_pid]..) [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change: rebalancer_pid -> undefined [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.046,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change: rebalance_status -> {none,<<"Rebalance stopped by janitor.">>} [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs [couchdb:info,2012-09-07T18:46:05.336,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200 [stats:error,2012-09-07T18:46:05.424,ns_1@10.3.121.13:<0.5179.2>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.3.121.26'] [ns_server:debug,2012-09-07T18:46:05.894,ns_1@10.3.121.13:<0.13614.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.10014.5>} [ns_server:debug,2012-09-07T18:46:05.939,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok [couchdb:info,2012-09-07T18:46:06.629,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/52.couch.1 -> /data/saslbucket/52.couch.1.compact [couchdb:info,2012-09-07T18:46:06.665,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/52" [couchdb:info,2012-09-07T18:46:06.678,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/52.couch.2 and /data/saslbucket/52.couch.1.compact. [couchdb:info,2012-09-07T18:46:06.981,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Compaction for db "saslbucket/52" completed. [couchdb:info,2012-09-07T18:46:07.039,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.13641.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/53">> [couchdb:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.8887.0>:couch_log:info:39]Starting compaction for db "saslbucket/53" [ns_server:info,2012-09-07T18:46:07.435,ns_1@10.3.121.13:<0.13244.5>:ns_orchestrator:handle_info:258]Skipping janitor in state janitor_running: {janitor_state, ["default"], <0.13595.5>} [ns_server:warn,2012-09-07T18:46:09.013,ns_1@10.3.121.13:mb_master:mb_master:send_heartbeat_with_peers:473]send heartbeat timed out |
| Comment by Farshid Ghods [ 10/Sep/12 ] |
|
if i understand correctly nodes are not in swap but there are still time outs in sending heartbeats
we have had this issue before and the bug was marked as won't fix due to the node being in swap. if there are doubts that this is happening because of slowness in some internal calls we should raise that timeout by default. |
| Comment by Aleksey Kondratenko [ 10/Sep/12 ] |
|
This is very interesting thing. Indeed there's no swap use. That however doesn't mean there are no major page faults. My guess is we're either dealing with: * cpu contention in erlang (less likely but we have a way to monitor that) * major page faults * virtual machine host 'robbing' our CPU cycles Unfortunately atop data I have only starts Sep 8 while logs end Sep 7 around 18:40. Need more data. |
| Comment by Karan Kumar [ 10/Sep/12 ] |
|
We definitely can get you more data here.
Can we tell us how to monitor cpu contention in erlang? Karan |
| Comment by Farshid Ghods [ 10/Sep/12 ] |
|
Karan, please work with engineering and Tony to understand what is exaclty required to investigate these issues. Tony has spent days and hours running these tests and still its unknown to us where and why this happens. i would like us to narrow this down to specific calls or interactions between components in the system. |
| Comment by Farshid Ghods [ 10/Sep/12 ] |
|
Thuan,
please revise the spec according the recommendation Aliaksey and Karan will pass to you so that we reduce hitting this condition for now. |
| Comment by Thuan Nguyen [ 10/Sep/12 ] |
|
Link to diags of all nodes on Sept 8th 2012
https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz |
| Comment by Aleksey Kondratenko [ 10/Sep/12 ] |
|
atop data starts at Sep 8 01:31:58
While last message with memcached slowness indication is at: 696802:[ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 720860 us I'd like you guys to keep atop data for longer periods of time. |
| Comment by Thuan Nguyen [ 10/Sep/12 ] |
|
Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
This atop file also stored at local vm at /tmp directory All other atop files of other nodes are in /tmp directory All of these atop started on drwx------ 2 root root 4.0K Sep 8 01:21 atop.d |
| Comment by Farshid Ghods [ 19/Sep/12 ] |
|
Karan,
should this be assigned to Aliaksey now ? |
| Comment by Karan Kumar [ 30/Sep/12 ] |
| Unable to repro on 1781 |