[MB-6573] [longevity] rebalance failed due to error "Resetting rebalance status since it's not really running" when there are major page faults on some of the nodes in the cluster Created: 07/Sep/12  Updated: 10/Jan/13  Resolved: 30/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Karan Kumar (Inactive)
Resolution: Fixed Votes: 0
Labels: 2.0-beta-release-notes, system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1697


 Description   
Cluster information:

- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node (80% total system memmories)
- Disk format ext3 on both data and root
- Each server has its own drive, no disk sharing with other server.
- Load 7 million items to both buckets
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
- Maintain load about 10 K ops, query view on both doc

10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23

Create cluster with 10 nodes
Do swap rebalance. Add node 26 and remove node 25.
Before and during rebalance, cluster does not go into swap.

Rebalance failed with error "Resetting rebalance status since it's not really running"

Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11ndoes-1697-reb-failed-reset-reb-20120907.tgz

Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-11nodes-1697-reb-failed-reset-reb-20120907.tgz


 Comments   
Comment by Thuan Nguyen [ 07/Sep/12 ]
couchdb:info,2012-09-07T18:46:04.076,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/51.couch.1 -> /data/saslbucket/51.couch.1.compact
[couchdb:info,2012-09-07T18:46:04.086,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/51"
[couchdb:info,2012-09-07T18:46:04.095,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/51.couch.2 and /data/saslbucket/51.couch.1.compact.
[couchdb:info,2012-09-07T18:46:04.104,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]Compaction for db "saslbucket/51" completed.
[ns_server:info,2012-09-07T18:46:04.105,ns_1@10.3.121.13:<0.13562.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/52">>
[couchdb:info,2012-09-07T18:46:04.106,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Starting compaction for db "saslbucket/52"
[couchdb:info,2012-09-07T18:46:04.522,ns_1@10.3.121.13:<0.7547.5>:couch_log:info:39]10.3.121.15 - - POST /_view_merge/?stale=ok&limit=10 200
[ns_server:debug,2012-09-07T18:46:04.668,ns_1@10.3.121.13:<0.13524.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok
[ns_server:debug,2012-09-07T18:46:04.728,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>}
[couchdb:info,2012-09-07T18:46:04.855,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200
[couchdb:info,2012-09-07T18:46:04.868,ns_1@10.3.121.13:<0.13271.5>:couch_log:info:39]Updater checkpointing set view `saslbucket` update for main group `_design/d11`
[ns_server:debug,2012-09-07T18:46:04.864,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok
[ns_server:debug,2012-09-07T18:46:04.904,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>}
[couchdb:info,2012-09-07T18:46:04.965,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200

[user:info,2012-09-07T18:46:05.012,ns_1@10.3.121.13:ns_config:ns_janitor:maybe_stop_rebalance_status:139]Resetting rebalance status since it's not really running

[ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[ns_server:debug,2012-09-07T18:46:05.018,ns_1@10.3.121.13:ns_config_rep:ns_config_rep:do_push_keys:317]Replicating some config keys ([rebalance_status,rebalancer_pid]..)
[ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change:
rebalancer_pid ->
undefined
[ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[ns_server:debug,2012-09-07T18:46:05.046,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change:
rebalance_status ->
{none,<<"Rebalance stopped by janitor.">>}
[ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
[couchdb:info,2012-09-07T18:46:05.336,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200
[stats:error,2012-09-07T18:46:05.424,ns_1@10.3.121.13:<0.5179.2>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.3.121.26']
[ns_server:debug,2012-09-07T18:46:05.894,ns_1@10.3.121.13:<0.13614.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.10014.5>}
[ns_server:debug,2012-09-07T18:46:05.939,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29002.1>} event handler: ok
[couchdb:info,2012-09-07T18:46:06.629,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/52.couch.1 -> /data/saslbucket/52.couch.1.compact
[couchdb:info,2012-09-07T18:46:06.665,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/52"
[couchdb:info,2012-09-07T18:46:06.678,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/52.couch.2 and /data/saslbucket/52.couch.1.compact.
[couchdb:info,2012-09-07T18:46:06.981,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Compaction for db "saslbucket/52" completed.
[couchdb:info,2012-09-07T18:46:07.039,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200
[ns_server:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.13641.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/53">>
[couchdb:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.8887.0>:couch_log:info:39]Starting compaction for db "saslbucket/53"
[ns_server:info,2012-09-07T18:46:07.435,ns_1@10.3.121.13:<0.13244.5>:ns_orchestrator:handle_info:258]Skipping janitor in state janitor_running: {janitor_state,
                                            ["default"],
                                            <0.13595.5>}
[ns_server:warn,2012-09-07T18:46:09.013,ns_1@10.3.121.13:mb_master:mb_master:send_heartbeat_with_peers:473]send heartbeat timed out
Comment by Farshid Ghods (Inactive) [ 10/Sep/12 ]
if i understand correctly nodes are not in swap but there are still time outs in sending heartbeats

we have had this issue before and the bug was marked as won't fix due to the node being in swap. if there are doubts that this is happening because of slowness in some internal calls we should raise that timeout by default.
Comment by Aleksey Kondratenko [ 10/Sep/12 ]
This is very interesting thing. Indeed there's no swap use. That however doesn't mean there are no major page faults.

My guess is we're either dealing with:

* cpu contention in erlang (less likely but we have a way to monitor that)

* major page faults

* virtual machine host 'robbing' our CPU cycles

Unfortunately atop data I have only starts Sep 8 while logs end Sep 7 around 18:40.

Need more data.
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
We definitely can get you more data here.

Can we tell us how to monitor
cpu contention in erlang?

Karan
Comment by Farshid Ghods (Inactive) [ 10/Sep/12 ]
Karan,

please work with engineering and Tony to understand what is exaclty required to investigate these issues.

Tony has spent days and hours running these tests and still its unknown to us where and why this happens.

i would like us to narrow this down to specific calls or interactions between components in the system.
Comment by Farshid Ghods (Inactive) [ 10/Sep/12 ]
Thuan,

please revise the spec according the recommendation Aliaksey and Karan will pass to you so that we reduce hitting this condition for now.
Comment by Thuan Nguyen [ 10/Sep/12 ]
Link to diags of all nodes on Sept 8th 2012

https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz
Comment by Aleksey Kondratenko [ 10/Sep/12 ]
atop data starts at Sep 8 01:31:58

While last message with memcached slowness indication is at:

 696802:[ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 720860 us

I'd like you guys to keep atop data for longer periods of time.
Comment by Thuan Nguyen [ 10/Sep/12 ]
Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
This atop file also stored at local vm at /tmp directory

All other atop files of other nodes are in /tmp directory

All of these atop started on
drwx------ 2 root root 4.0K Sep 8 01:21 atop.d

Comment by Farshid Ghods (Inactive) [ 19/Sep/12 ]
Karan,

should this be assigned to Aliaksey now ?
Comment by Karan Kumar (Inactive) [ 30/Sep/12 ]
Unable to repro on 1781
Generated at Tue Oct 21 04:49:16 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.