Pesistence, replication & data loss

seans · July 31, 2013, 10:50am

In our environment, there a 4 Couchbase servers in the cluster with bucket called “cm-config” configured to contain 3 replicas. From time to time, some of the nodes become unreachable which leads to nodes being automatically failovered. However, when 2 of the 4 nodes are failovered at the same time there appears to be some data loss. The data that was missing was set quite a number of weeks previously, so I would have expected the it would have been correctly persisted and replicated in that time which should have meant that once 2 nodes were failovered, the replicas on the other servers would now become replicas. Is this correct? Would you expect to see some data loss in the above scenario? If so, is there some way we can lessen the chances of data loss?

tgrall · August 1, 2013, 2:00pm

Hello,

No you should not have any dataloss in the scenario you are describing. Is this something you can reproduce easily?

Based on your configuration, 4 nodes 3 replicas (that is more than secured), you have all the data on each nodes so the fail over should promote the replicas and you should be able to work properly after the fail over.

While I am doing more test, do you know why some server are unreachable?

seans · August 1, 2013, 2:32pm

It happens sporadically in our production environment but I’ve yet to be able to reproduce in our test environment.

As for why the servers are unreachable, I’m not sure what happened. The machines themselves were still accessible (based on the uptime of the machines). I can see the following in the Couchbase logs multiple times:

…

Could not auto-failover more nodes (‘ns_1@10.110.32.164’). Maximum number of nodes that will be automatically failovered (1) is reached. - auto_failover002 - ns_1@10.110.32.162 - 02:55:24 - Tue Jul 30, 2013
Could not auto-failover more nodes (‘ns_1@10.110.32.164’). Maximum number of nodes that will be automatically failovered (1) is reached. (repeated 71 times) - auto_failover002 - ns_1@10.110.32.162 - 02:55:24 - Tue Jul 30, 2013
Could not auto-failover more nodes (‘ns_1@10.110.32.164’). Maximum number of nodes that will be automatically failovered (1) is reached. auto_failover002 - ns_1@10.110.32.162 - 02:49:24 - Tue Jul 30, 2013
Could not auto-failover more nodes (‘ns_1@10.110.32.164’). Maximum number of nodes that will be automatically failovered (1) is reached. (repeated 68 times) - auto_failover002 - ns_1@10.110.32.162 - 02:49:24 - Tue Jul 30, 2013
Node ‘ns_1@10.110.32.162’ saw that node ‘ns_1@10.110.32.164’ went down. - ns_node_disco005 - ns_1@10.110.32.162 - 02:43:55 - Tue Jul 30, 2013
Node ‘ns_1@10.110.32.161’ saw that node ‘ns_1@10.110.32.164’ went down. - ns_node_disco005 - ns_1@10.110.32.161 - 02:43:54 - Tue Jul 30, 2013
Node ‘ns_1@10.110.32.163’ saw that node ‘ns_1@10.110.32.164’ went down. - ns_node_disco005 - ns_1@10.110.32.163 - 02:43:54 - Tue Jul 30, 2013
Could not auto-failover more nodes (‘ns_1@10.110.32.164’). Maximum number of nodes that will be automatically failovered (1) is reached.
…

tgrall · August 2, 2013, 8:08am

I guess it happened when you have only 2 machines. It is not possible to do an auto-failover with less than 3 nodes.

Where you servers are deployed? Do you have good/reliable network between the different nodes?

seans · August 6, 2013, 8:54am

Yes, the network should be reliable, however I will check this.

However, why is it not possible to do auto-failover with 3 or less nodes?

tgrall · October 2, 2013, 4:30pm

Hello,

To continue from the comment discussion.

You cannot enable auto-failover on a 2 node cluster (and the auto-failover will not happen when you cluster is reduced to 2 nodes) because you cannot move to a 1 node cluster… in this case if you end up with a single node cluster you do not have any replicas so you may lose data.

This is just a safety net that force deployment to always have 2 node or more.

Regards
Tug
@tgrall

Topic		Replies	Views
Question on Recovering cluster Couchbase Server	4	1324	April 7, 2017
How do replicas work? Couchbase Server	4	2490	June 2, 2016
Automatic failover in an environment where any server could die at any time Couchbase Server	1	1317	April 27, 2017
Why 3 node cluster for Automatic Failover?: Couchbase Server	3	4457	July 21, 2017
Auto rebalance after node failure Couchbase Server	11	5749	May 17, 2017

Pesistence, replication & data loss

Related topics