[MB-6481] Rebalance failed due to the timeout in bulk_set_vbucket_state operation with continuous view queries Created: 29/Aug/12  Updated: 10/Jan/13  Resolved: 05/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-beta
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Duplicate Votes: 0
Labels: 2.0-beta-release-notes, system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit

Attachments: GZip Archive 11nodes-1645-stats-memcached-close-connection-by-force-20120829.txt.gz     File memcached-log-1663-20120905.tgz    

 Description   
Install couchbase server 2.0.0-1645 on 11 nodes
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.26
10.3.121.24
10.3.121.25
10.3.121.23
Create cluster 10 nodes
Create 2 buckets, default and saslbucket
Create 1 docs with 2 views per doc for each bucket
Load 7+ million items to default bucket and 1+ million items to saslbucket
Maintain load about 12 K ops on default bucket and 3 K ops on saslbucket
Continuously querying views on both buckets
Rebalance in and out node
Then do swap rebalance by add 3 nodes ( 24, 25 and 26) to cluster and remove 3 nodes (20, 22 and 23) out of cluster
After 3 minutes rebalance, rebalance failed with error


2012-08-29 01:03:09.578 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
                                 'ns_1@10.3.121.15','ns_1@10.3.121.16',
                                 'ns_1@10.3.121.17','ns_1@10.3.121.24',
                                 'ns_1@10.3.121.25','ns_1@10.3.121.26'], EjectNodes = ['ns_1@10.3.121.20',
                                                                                       'ns_1@10.3.121.22',
                                                                                       'ns_1@10.3.121.23']

2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "saslbucket"
2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "default"
2012-08-29 01:03:09.896 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "saslbucket"
2012-08-29 01:03:09.898 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "default"
2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "saslbucket"
2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "default"
2012-08-29 01:03:13.771 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket saslbucket
2012-08-29 01:03:14.794 ns_memcached:1:info:message(ns_1@10.3.121.26) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.26' in 0 seconds.
2012-08-29 01:03:14.820 ns_memcached:1:info:message(ns_1@10.3.121.24) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.24' in 0 seconds.
2012-08-29 01:03:14.831 ns_memcached:1:info:message(ns_1@10.3.121.25) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.25' in 0 seconds.
2012-08-29 01:03:16.674 ns_vbucket_mover:0:info:message(ns_1@10.3.121.13) - Bucket "saslbucket" rebalance appears to be swap rebalance
2012-08-29 01:06:21.275 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                               [{'ns_1@10.3.121.20',
                                 {'EXIT',
                                  {{timeout,
                                    {gen_server,call,
                                     [timeout_diag_logger,
                                      {diag,
                                       {timeout,
                                        {gen_server,call,[ns_config,get]}}}]}},
                                   {gen_server,call,
                                    [{'janitor_agent-saslbucket',
                                      'ns_1@10.3.121.20'},
                                     {if_rebalance,<0.12846.432>,
                                      {update_vbucket_state,294,replica,
                                       undefined,undefined}},
                                     infinity]}}}}]},
                              [{janitor_agent,bulk_set_vbucket_state,4},
                               {ns_vbucket_mover,
                                update_replication_post_move,3},
                               {ns_vbucket_mover,handle_info,2},
                               {gen_server,handle_msg,5},
                               {proc_lib,init_p_do_apply,3}]}



Look at log on node 20 around 2012-08-29 01:06:21 before rebalance failed, see memcached connection closed by force

[ns_server:info,2012-08-29T1:05:49.620,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.22'

[ns_server:debug,2012-08-29T1:05:50.500,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
[ns_server:debug,2012-08-29T1:05:50.656,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:113]Got full synchronization request from 'ns_1@10.3.121.13'
[ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_log:ns_config_log:log_common:111]config change:
buckets ->
[{configs,[{"default",
            [{map,[]},
             {fastForwardMap,[]},
             {uuid,<<"cdf4999f6920e2d4c3526e17f36c843f">>},
             {sasl_password,[]},
             {num_replicas,2},
             {replica_index,true},
             {ram_quota,3145728000},
             {auth_type,sasl},
             {autocompaction,false},
             {type,membase},
             {num_vbuckets,1024},
             {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14',
                       'ns_1@10.3.121.15','ns_1@10.3.121.16',
                       'ns_1@10.3.121.17','ns_1@10.3.121.20',
                       'ns_1@10.3.121.22','ns_1@10.3.121.23']}]},
           {"saslbucket",
            [{map,[{202,
                    ['ns_1@10.3.121.15','ns_1@10.3.121.22','ns_1@10.3.121.16'],
                    ['ns_1@10.3.121.15','ns_1@10.3.121.25',
                     'ns_1@10.3.121.16']}]},
             {fastForwardMap,[]},
             {uuid,<<"9f1fdb211536936b49c8a392e079ecb0">>},
             {num_replicas,2},
             {replica_index,true},
             {ram_quota,3145728000},
             {auth_type,sasl},
             {sasl_password,"password"},
             {autocompaction,false},
             {type,membase},
             {num_vbuckets,1024},
             {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14',
                       'ns_1@10.3.121.15','ns_1@10.3.121.16',
                       'ns_1@10.3.121.17','ns_1@10.3.121.24',
                       'ns_1@10.3.121.25','ns_1@10.3.121.26',
                       'ns_1@10.3.121.20','ns_1@10.3.121.22',
                       'ns_1@10.3.121.23']}]}]}]
[ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:119]Fully synchronized config in 458 us
[ns_server:info,2012-08-29T1:05:50.857,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:50.552220 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.17' - disconnected, keep alive for 300 seconds
memcached<0.506.0>: Wed Aug 29 08:05:50.623620 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.13' - disconnected, keep alive for 300 seconds

[ns_server:debug,2012-08-29T1:05:50.902,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
[ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:<0.16154.0>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 1276102 us
[ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:'ns_memcached-saslbucket':ns_memcached:handle_info:594]handle_info(ensure_bucket,..) took too long: 809713 us
[stats:warn,2012-08-29T1:05:51.387,ns_1@10.3.121.20:<0.16160.0>:stats_collector:latest_tick:201]Dropped 1 ticks
[ns_server:info,2012-08-29T1:05:52.779,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:52.492646 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.24' - Connection is closed by force.


Link to diags of all nodes
https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11nodes-1645-memchaced-close-connection-by-force-20120829.tgz


 Comments   
Comment by Thuan Nguyen [ 29/Aug/12 ]
4 minutes later after this rebalance failed, I did swap rebalance 3 nodes again. Rebalance failed after 13 minutes with error
Rebalance exited with reason {ns_node_disco_events_exited,
{'EXIT',<0.26298.432>,normal}}
ns_orchestrator002 ns_1@10.3.121.13 01:23:41 - Wed Aug 29, 2012

Got exit from node disco events subscription ns_vbucket_mover000 ns_1@10.3.121.13 01:23:39 - Wed Aug 29, 2012

I looked at diags on node 13, I saw the same error of this bug, memcached connection is closed by force

** At node 13:
[ns_server:info,2012-08-29T1:23:40.678,ns_1@10.3.121.13:<0.20579.433>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.20': [<<"replication_building_548_'ns_1@10.3.121.24'">>,
                                                       <<"replication_building_548_'ns_1@10.3.121.16'">>,
                                                       <<"replication_building_548_'ns_1@10.3.121.17'">>]
[error_logger:error,2012-08-29T1:23:40.713,ns_1@10.3.121.13:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.20570.433>
    registered_name: []
    exception exit: {exited,
                        {'EXIT',<0.26285.432>,
                            {ns_node_disco_events_exited,
                                {'EXIT',<0.26298.432>,normal}}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.26285.432>,<0.26141.432>]
    messages: [{'EXIT',<0.26285.432>,
                      {ns_node_disco_events_exited,
                          {'EXIT',<0.26298.432>,normal}}}]
    links: [<0.20578.433>,<0.26285.432>]
    dictionary: [{cleanup_list,[<0.20579.433>]}]
    trap_exit: true
    status: running
    heap_size: 1597
    stack_size: 24
    reductions: 3010
  neighbours:

[couchdb:info,2012-08-29T1:23:40.760,ns_1@10.3.121.13:<0.29969.390>:couch_log:info:39]10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200
[ns_server:debug,2012-08-29T1:23:40.933,ns_1@10.3.121.13:<0.20580.433>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.15525.374>} event handler: ok
[ns_server:info,2012-08-29T1:23:38.869,ns_1@10.3.121.13:<0.20467.433>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1017:persisted_checkpoint_id">>,<<"65">>},

** At node 20:
[couchdb:info,2012-08-29T1:23:40.724,ns_1@10.3.121.20:<0.13706.405>:couch_log:info:39]Updater reading changes from passive partitions to update replica set view group `_design/d1` from set `default`
[couchdb:info,2012-08-29T1:23:40.734,ns_1@10.3.121.20:<0.31308.390>:couch_log:info:39]10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200
[ns_server:info,2012-08-29T1:23:40.737,ns_1@10.3.121.20:<0.16149.0>:mc_connection:do_delete_vbucket:123]Deleting vbucket: saslbucket/547
[couchdb:info,2012-08-29T1:23:40.759,ns_1@10.3.121.20:couch_server:couch_log:info:39]Deleting database saslbucket/547
[couchdb:info,2012-08-29T1:23:40.808,ns_1@10.3.121.20:couch_server:couch_log:info:39]Deleting file /data/saslbucket/547.couch.313
[ns_server:info,2012-08-29T1:23:40.883,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:23:40.677138 3: TAP (Producer) eq_tapq:replication_building_548_'ns_1@10.3.121.24' - Connection is closed by force.

Comment by Karan Kumar (Inactive) [ 05/Sep/12 ]
We looked at the memcached logs.
@Tony is going to attach the memcached logs from node 13.
memcached log file named memcached-log-1663-20120905.tgz in attached location above.

On the node 13. Rebalance failure around this time.
Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.16',
{'EXIT',
{{normal,
{gen_server,call,
[<15201.5655.244>,
{start_vbucket_filter_change,
[171,197,224,225,226,227,228,475,476,
477,478,479,531,532,533,655,656,657]},
30000]}},
{gen_server,call,
[{'janitor_agent-saslbucket',
'ns_1@10.3.121.16'},
{if_rebalance,<0.6019.435>,
{update_vbucket_state,657,replica,
undefined,'ns_1@10.3.121.23'}},
infinity]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}

Node 13 memcached logs:-
Wed Sep 5 16:30:35.608213 3: 101 Closing connection due to read error: Connection reset by peer

From memcached.c code:-

        if (res == -1) {
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                break;
            }
            settings.extensions.logger->log(EXTENSION_LOG_WARNING, c,
                                            "%d Closing connection due to read error: %s", c->sfd, strerror(errno));
            return READ_ERROR;
        }

Comment by Chiyoung Seo [ 05/Sep/12 ]
"connection is closed by force" is the normal log from ep-engine whenever ep-engine receives the connection close request from the ns-server after completing the replication building phase. In other words, it's not an error case.

The rebalance originally failed due to the timeout from 10.3.121.20 in processing {update_vbucket_state,294,replica} on the sasl bucket. We were continuously querying views on both buckets during rebalance.

It would be more useful to grab any paging and swap activities during rebalance, but I think we already had the bugs opened for the very similar timeout issues with continuous view updates and queries.
Comment by Karan Kumar (Inactive) [ 05/Sep/12 ]
Looked at the ns_server code.

-define(SET_VBUCKET_STATE_TIMEOUT, infinity).

The vbucket_state_timeout is set to infinity. Not sure, what we can update more to?
Comment by Peter Wansch (Inactive) [ 19/Sep/12 ]
Alk, the timeout was set to inifinity and we still received a timeout. Who should next look at this?
Comment by Chiyoung Seo [ 05/Oct/12 ]
http://www.couchbase.com/issues/browse/MB-6595
Generated at Thu Nov 27 21:15:00 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.