Couchbase-server failover removes node from cluster

When running a Couchbase cluster of multiple nodes, 2 in this case, we notice that as of CB 7 a node is removed from the cluster upon performing a failover and can not be readded. This is different behavior than it was in CB 4.5…6.6.
This is seen on both CentOS7 and RockLinux 8.

Scenario:

  • 2 nodes running CB 7.1.3 EE (just data service, one or more couchbase buckets)
  • 1 node is stopped
  • a failover-hard is performed, either via couchbase-cli or API, or UI
    Result
  • the failed node is removed from the cluster
  • the failed node can not be readded to the cluster
    Expected result:
  • the failed node is marked as ‘unhealthy/inactiveFailed’
  • once the node is started, it can be readded to the cluster and data can be rebalanced.

Is this an intended change in behavior of the ‘failover’ functionality, or am I overlooking something?

Hi @penacho - I’d say this isn’t expected but we’d likely need more information and logs to troubleshoot why it’s happening.

Can you provide the error message you get when trying to re-add it? If you have a support contract with us, it would be good to open a ticket so they can analyse the logs.

Hi @perry,

Glad to here that this isn’t expected.
I’m a bit puzzled why failover/recovery doesn’t seem to work anymore the way it used to work. I have to admit that we did stick to version 4.6 for a long time and only recently moved through 5.1, 6.6 to 7.1. Probably I have missed something in the concept of failover/recovery being changed in the recent versions.

This is a simple scenario I used to test this:

(I’m not allowed to put the whole scenario in this post: 403 Forbidden ?!)

tarting with an up&running 2-node cluster of CB 7.1.3 EE on CentOS7:

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@192.168.99.151 192.168.99.151:8091 healthy active
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli bucket-list -c 127.0.0.1 -u Administrator -p admin123
conv_session_info
bucketType: membase
numReplicas: 1
ramQuota: 536870912
ramUsed: 331122208

After simulating a failed node/server by shutting it down, its state changes to unhealty, as expected:
(no auto-failover and such in place for this test)

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@192.168.99.151 192.168.99.151:8091 unhealthy active
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

Now I want to force a failover, so that the replica items on the remaining server are activated:
(trimming the curl command a bit, otherwise posting is not allowed)

curl /controller/failOver -d 'otpNode=ns_1@192.168.99.151'
HTTP/1.1 504 Gateway Time-out

Bummer, it fails (as the node can’t be reached), so try harder:

curl /controller/failOver -d 'otpNode=ns_1@192.168.99.151' -d allowUnsafe=true
HTTP/1.1 200 OK

That worked, but now the 2nd node is gone from the cluster :astonished:

rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

whereas with CB upto 6.6 it would remain in the cluster as ‘unhealthy inactiveFailed’

Being the node removed from the cluster, it can’t be added back into the cluster after it has been started again:

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli recovery -c 127.0.0.1 -u Administrator -p admin123 --server-recovery 192.168.99.151
ERROR: Server not found 192.168.99.151:8091

Hi @penacho,

It looks like you’re attempting to do an unsafe (quorum) failover here:

curl /controller/failOver -d ‘otpNode=ns_1@192.168.99.151’ -d allowUnsafe=true
HTTP/1.1 200 OK

This type of failover is not a typical failover, and did not exist in older versions of Couchbase Server. It behaves slightly differently to a typical failover as it is designed to allow fewer than half of the nodes in a cluster to continue operation (with as much data as is available) after the majority of nodes in the cluster experience an issue. You can read more about it here.

In particular, I’d like to draw your attention to this section of the documentation on the consequences of unsafe failover. These consequences, I believe, are the issues that your are experiencing:

The nodes that have been failed over are also immediately removed from the cluster. They are, however, not informed of their removal; and so may continue to attempt to behave as if members of the cluster.

The failed over nodes cannot be recovered; and will therefore need to be re-initialized, if they are to be re-introduced into the cluster.

Interesting… thanks for pointing that out.

I’ve been reading mostly here to look for changes from previous releases:

and

I find that information is quite fragmented and scattered around, making it a bit hard to tie it all together.

Then my remaining issue is:

  • considering a two node cluster
  • one node fails for whatever reason (is expected to becom available after some time)
  • we want to failover that node to activate the replica entries on the remaining node
  • AFAIK this calls for a hard failover, but that fails:
[rgr@cb7-a ~]# curl <...>  http://127.0.0.1:8091/controller/failOver -d 'otpNode=ns_1@192.168.99.151'
< HTTP/1.1 504 Gateway Time-out

Sometimes the response is:

Cannot safely perform a failover at the moment

Hi @BenHuddleston,

If I understand this correctly, this implies that on a two-node cluster (yes, we know: not recommended/ideal, but we do use this) it is no longer possible a failover a node with the intention to add it back later when it is restored? This is a significant change compared to previous versions of CB.

Can this ‘quorum failure’ be overruled?
I looked for a setting, but haven’t found it yet

Interestingly, Perform Hard Failover | Couchbase Docs
uses screenshots with just two nodes in the cluster…

With a two node cluster when one node is unreachable and you attempt to fail it over this is the expected behaviour.

[rgr@cb7-a ~]# curl <…> http://127.0.0.1:8091/controller/failOver -d ‘otpNode=ns_1@192.168.99.151’
HTTP/1.1 504 Gateway Time-out

Sometimes the response is:

Cannot safely perform a failover at the moment

As a majority (more than half) of the nodes in a two node cluster is two, it is not possible to fail over one of the nodes in a safe and consistent manner due to the scenario described here. To allow such a failover would allow a potentially inconsistent state and inconsistent data to be served.

If I understand this correctly, this implies that on a two-node cluster (yes, we know: not recommended/ideal, but we do use this) it is no longer possible a failover a node with the intention to add it back later when it is restored?

It depends why the node being failed over is being failed over in this case. If the cluster management service on both nodes is alive and running, but the data service is not, then failing over one of the nodes will be possible. If, however, there is a network partition and the two nodes cannot communicate then it is not possible to fail over either node as a majority consensus (quorum) cannot be reached.

This is a significant change compared to previous versions of CB.

Indeed, but it improves metadata/data safety and consistency, and as you pointed out earlier, we recommend against two node clusters being deployed for reasons such as this.

Can this ‘quorum failure’ be overruled?

The only override for this case is the allowUnsafe option of the failover API which you have previously used. Unfortunately this does have the side effect of not allowing recovery of the node, again, for the sake of metadata/data safety and consistency.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.