Unable to reset the auto-failover quota
Hi,
From the automated alert mails, I just not got the following alert:
"Could not auto-failover more nodes ('ns_1@xxx.xxx.xxx.xxx'). Maximum number of nodes that will be automatically failovered (1) is reached."
Upon logging into the admin console, I see all the 8 nodes are up and running (showing green). Though there are numerous messages about failover not being successful:
Could not auto-failover more nodes ('ns_1@xxx.xxx.xxx.xxx'). Maximum number of nodes that will be automatically failovered (1) is reached.
auto_failover002 ns_1@xxx.xxx.xxx.xxx 14:24:11 - Tue Mar 12, 2013
I also see the Reset Quota button in enabled. However, when I click on it, an error is thrown stating "Unable to reset the auto-failover quota" and the logs show this entry:
Server error during processing: ["web request failed",
{path,"/settings/autoFailover/resetCount"},
{type,exit},
{what,
{noproc,
{gen_server,call,
[{global,auto_failover},
reset_auto_failover_count]}}},
{trace,
[{gen_server,call,2},
{menelaus_web,
handle_settings_auto_failover_reset_count,
1},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}] menelaus_web019 ns_1@xxx.xxx.xxx.xxx 14:51:03 - Tue Mar 12, 2013
Any suggestions/ideas what to do? How can I ensure the cluster is healthy?
regards,
-Piyush
I have retried a number of times, but the result is the same.
Under-sizing is not the problem (fortunately) with this cluster. it is an 8 node cluster and bare 20% of resources are getting used as of now.
I have generated the diag report. will upload it along with the bug that i file on your JIRA.
I do have another question. We are planning to upgrade this cluster to 2.0.0. Now since I cannot reset the failover quota, how do I go about the whole thing? When i'll bring a node down, failover won't work. Subsequently rebalancing will cause data loss. any thoughts?
thanks for looking into the issue.
regards,
-Piyush
created JIRA issue: http://www.couchbase.com/issues/browse/MB-7967
thanks,
-Piyush
We now have more evidence that it can be erlang's global service issue. I.e. MB-7282
Last error message means autofailover service crashed recently. It'll normally be restarted and it's safe to retry this operation.
On the other hand crash itself (yes I know it's not good that it's not easily visible in logs) is a sign of some unhealthiness of your environment. Most often (but not necessarily same as yours) cause of problems like that is under-sized cluster. Where OS paging is causing erlang-side timeouts and crashes all over the place.
2.0.1 which is, sadly, for now enterprise-only release is improving things a lot in this regard. Particularly on GNU/Linux we're doing mlockall of erlang VM. AFAIK our policy is to make it available as community edition builds in few weeks after enterprise-only release. But note that you always have option of building product yourself. 2.0.1 repo manifest and tags are all public.
In order to dive into this issue yourself you'll need to look at server logs. Perhaps multiple node's logs. There are two ways to grab this logs (as well as a bunch of other useful diagnostics). One is clicking 'generate diagnostics report' link at the top of Log section. And other is by running cbcollect_info tool shipped with product.
Feel free to file bug with logs attached if you want me to take a look at your case.