Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Couchbase | Couchbase Server 1.8.x

Unable to reset the auto-failover quota

4 replies [Last post]
  • Login or register to post comments
Tue, 03/12/2013 - 02:28
p77gin
Offline
Joined: 07/26/2012
Groups: None

Hi,

From the automated alert mails, I just not got the following alert:

"Could not auto-failover more nodes ('ns_1@xxx.xxx.xxx.xxx'). Maximum number of nodes that will be automatically failovered (1) is reached."

Upon logging into the admin console, I see all the 8 nodes are up and running (showing green). Though there are numerous messages about failover not being successful:

Could not auto-failover more nodes ('ns_1@xxx.xxx.xxx.xxx'). Maximum number of nodes that will be automatically failovered (1) is reached.
auto_failover002 ns_1@xxx.xxx.xxx.xxx 14:24:11 - Tue Mar 12, 2013

I also see the Reset Quota button in enabled. However, when I click on it, an error is thrown stating "Unable to reset the auto-failover quota" and the logs show this entry:

Server error during processing: ["web request failed",
{path,"/settings/autoFailover/resetCount"},
{type,exit},
{what,
{noproc,
{gen_server,call,
[{global,auto_failover},
reset_auto_failover_count]}}},
{trace,
[{gen_server,call,2},
{menelaus_web,
handle_settings_auto_failover_reset_count,
1},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}] menelaus_web019 ns_1@xxx.xxx.xxx.xxx 14:51:03 - Tue Mar 12, 2013

Any suggestions/ideas what to do? How can I ensure the cluster is healthy?

regards,
-Piyush

Top
  • Login or register to post comments
Fri, 03/22/2013 - 11:43
alkondratenko
alkondratenko's picture
Offline
Joined: 12/01/2010
Groups: None

Last error message means autofailover service crashed recently. It'll normally be restarted and it's safe to retry this operation.

On the other hand crash itself (yes I know it's not good that it's not easily visible in logs) is a sign of some unhealthiness of your environment. Most often (but not necessarily same as yours) cause of problems like that is under-sized cluster. Where OS paging is causing erlang-side timeouts and crashes all over the place.

2.0.1 which is, sadly, for now enterprise-only release is improving things a lot in this regard. Particularly on GNU/Linux we're doing mlockall of erlang VM. AFAIK our policy is to make it available as community edition builds in few weeks after enterprise-only release. But note that you always have option of building product yourself. 2.0.1 repo manifest and tags are all public.

In order to dive into this issue yourself you'll need to look at server logs. Perhaps multiple node's logs. There are two ways to grab this logs (as well as a bunch of other useful diagnostics). One is clicking 'generate diagnostics report' link at the top of Log section. And other is by running cbcollect_info tool shipped with product.

Feel free to file bug with logs attached if you want me to take a look at your case.

Top
  • Login or register to post comments
Mon, 03/25/2013 - 22:42
p77gin
Offline
Joined: 07/26/2012
Groups: None

I have retried a number of times, but the result is the same.

Under-sizing is not the problem (fortunately) with this cluster. it is an 8 node cluster and bare 20% of resources are getting used as of now.

I have generated the diag report. will upload it along with the bug that i file on your JIRA.

I do have another question. We are planning to upgrade this cluster to 2.0.0. Now since I cannot reset the failover quota, how do I go about the whole thing? When i'll bring a node down, failover won't work. Subsequently rebalancing will cause data loss. any thoughts?

thanks for looking into the issue.

regards,
-Piyush

Top
  • Login or register to post comments
Mon, 03/25/2013 - 22:59
p77gin
Offline
Joined: 07/26/2012
Groups: None

created JIRA issue: http://www.couchbase.com/issues/browse/MB-7967

thanks,
-Piyush

Top
  • Login or register to post comments
Mon, 04/08/2013 - 18:53
alkondratenko
alkondratenko's picture
Offline
Joined: 12/01/2010
Groups: None

We now have more evidence that it can be erlang's global service issue. I.e. MB-7282

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker