Unable to disable autofailover

Hi,

After a recent outage caused by the failover of a single node, we’re looking to temporarily disable the auto-failover option.
However, whichever method I use (curl or couchbase-cli), across any of the nodes, I’m unable to disable (or even amend the timeout) of the auto-failover setting.

  • curl:

curl -v -u admin:password http://localhost:8091/settings/autoFailover -d ‘enabled=false’

  • About to connect() to localhost port 8091 (#0)
  • Trying 127.0.0.1… connected
  • Connected to localhost (127.0.0.1) port 8091 (#0)
  • Server auth using Basic with user ‘admin’

POST /settings/autoFailover HTTP/1.1
Authorization: Basic abc123
User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
Host: localhost:8091
Accept: /
Content-Length: 13
Content-Type: application/x-www-form-urlencoded

< HTTP/1.1 500 Internal Server Error
< Server: Couchbase Server
< Pragma: no-cache
< Date: Thu, 19 Jan 2017 10:26:20 GMT
< Content-Type: application/json
< Content-Length: 44
< Cache-Control: no-cache
<

  • Connection #0 to host localhost left intact
  • Closing connection #0
    [“Unexpected server error, request logged.”]
  • couchbase-cli:

ERROR: unable to set auto failover settings (500) Internal Server Error
[u’Unexpected server error, request logged.’]

Results in the following message in the error log:

[ns_server:error,2017-01-19T10:15:22.675,ns_1@node1.cbcluster.com:<0.31120.483>:menelaus_web:loop:170]Server error during processing: [“web request failed”,
{path,"/settings/autoFailover"},
{type,exit},
{what,
{noproc,
{gen_server,call,
[{global,auto_failover},
disable_auto_failover]}}},
{trace,
[{gen_server,call,2,
[{file,“gen_server.erl”},{line,180}]},
{menelaus_web,
handle_settings_auto_failover_post,1,
[{file,“src/menelaus_web.erl”},
{line,1870}]},
{request_throttler,do_request,3,
[{file,“src/request_throttler.erl”},
{line,59}]},
{menelaus_web,loop,2,
[{file,“src/menelaus_web.erl”},
{line,149}]},
{mochiweb_http,headers,5,
[{file,
"/home/buildbot/buildbot_slave/centos-6-x64-301-builder/build/build/couchdb/src/mochiweb/mochiweb_http.erl"},
{line,94}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]}]

Any suggestions as to the cause (or even better, a fix) for the above issue? Happy to provide further logs if required.

Cluster
5 server nodes
5 buckets (single replica)
~100 opsecs on 3 of these buckets
Version: 3.0.1 Community Edition (build-1444)

Development plans are undergoing to upgrade to v4.5.

Is anyone able to help with this? We’re still unable to disable autofailover.

I’m not sure what’s going on - seems like the REST API isn’t very happy.

I assume you’ve tried the UI to change this?

Hi drigby,
Thanks for the reply. Yes, I’ve tried the UI on each of the nodes. I receive a standard error:

[ns_server:error,2017-01-26T11:26:08.483,ns_1@nod4.mydomain.com:<0.150.3009>:menelaus_web:loop:170]Server error during processing: [“web request failed”,
{path,"/settings/autoFailover"},
{type,exit},
{what,
{noproc,
{gen_server,call,
[{global,auto_failover},
disable_auto_failover]}}},
{trace,
[{gen_server,call,2,
[{file,“gen_server.erl”},{line,180}]},
{menelaus_web,
handle_settings_auto_failover_post,1,
[{file,“src/menelaus_web.erl”},
{line,1870}]},
{request_throttler,do_request,3,
[{file,“src/request_throttler.erl”},
{line,59}]},
{menelaus_web,loop,2,
[{file,“src/menelaus_web.erl”},
{line,149}]},
{mochiweb_http,headers,5,
[{file,
"/home/buildbot/buildbot_slave/centos-6-x64-301-builder/build/build/couchdb/src/mochiweb/mochiweb_http.erl"},
{line,94}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]}]

Hmm. Is there anything in the other log files relating to auto_failover? It does sound like something in the cluster manager (aka ns_server) isn’t happy and cannot accept the setting change. Are all your nodes up?

The only reference to “auto_failover” is in the error, debug and info.log files, all of which reference the errors above (or my attempts to reset the count).

All the nodes are up (a rebalance is currently required, but only from today).

jbarton It’s possible this is related to a bug in Erlang, see https://issues.couchbase.com/browse/MB-7282

Workaround it to run the following -
wget --user=Administrator --password=asdasd --post-data=‘rpc:call(mb_master:master_node(), erlang, apply ,[fun () -> erlang:exit(erlang:whereis(mb_master), kill) end, []]).’ http://localhost:8091/diag/eval