Pesistence, replication & data loss

In our environment, there a 4 Couchbase servers in the cluster with bucket called "cm-config" configured to contain 3 replicas. From time to time, some of the nodes become unreachable which leads to nodes being automatically failovered. However, when 2 of the 4 nodes are failovered at the same time there appears to be some data loss. The data that was missing was set quite a number of weeks previously, so I would have expected the it would have been correctly persisted and replicated in that time which should have meant that once 2 nodes were failovered, the replicas on the other servers would now become replicas. Is this correct? Would you expect to see some data loss in the above scenario? If so, is there some way we can lessen the chances of data loss?

Hello,

No you should not have any dataloss in the scenario you are describing. Is this something you can reproduce easily?

Based on your configuration, 4 nodes 3 replicas (that is more than secured), you have all the data on each nodes so the fail over should promote the replicas and you should be able to work properly after the fail over.

While I am doing more test, do you know why some server are unreachable?

It happens sporadically in our production environment but I've yet to be able to reproduce in our test environment.

As for why the servers are unreachable, I'm not sure what happened. The machines themselves were still accessible (based on the uptime of the machines). I can see the following in the Couchbase logs multiple times:

...
> Could not auto-failover more nodes ('ns_1@10.110.32.164'). Maximum number of nodes that will be automatically failovered (1) is reached. - auto_failover002 - ns_1@10.110.32.162 - 02:55:24 - Tue Jul 30, 2013
> Could not auto-failover more nodes ('ns_1@10.110.32.164'). Maximum number of nodes that will be automatically failovered (1) is reached. (repeated 71 times) - auto_failover002 - ns_1@10.110.32.162 - 02:55:24 - Tue Jul 30, 2013
> Could not auto-failover more nodes ('ns_1@10.110.32.164'). Maximum number of nodes that will be automatically failovered (1) is reached. auto_failover002 - ns_1@10.110.32.162 - 02:49:24 - Tue Jul 30, 2013
> Could not auto-failover more nodes ('ns_1@10.110.32.164'). Maximum number of nodes that will be automatically failovered (1) is reached. (repeated 68 times) - auto_failover002 - ns_1@10.110.32.162 - 02:49:24 - Tue Jul 30, 2013
> Node 'ns_1@10.110.32.162' saw that node 'ns_1@10.110.32.164' went down. - ns_node_disco005 - ns_1@10.110.32.162 - 02:43:55 - Tue Jul 30, 2013
> Node 'ns_1@10.110.32.161' saw that node 'ns_1@10.110.32.164' went down. - ns_node_disco005 - ns_1@10.110.32.161 - 02:43:54 - Tue Jul 30, 2013
> Node 'ns_1@10.110.32.163' saw that node 'ns_1@10.110.32.164' went down. - ns_node_disco005 - ns_1@10.110.32.163 - 02:43:54 - Tue Jul 30, 2013
> Could not auto-failover more nodes ('ns_1@10.110.32.164'). Maximum number of nodes that will be automatically failovered (1) is reached.
...

I guess it happened when you have only 2 machines. It is not possible to do an auto-failover with less than 3 nodes.

Where you servers are deployed? Do you have good/reliable network between the different nodes?

Yes, the network should be reliable, however I will check this.

However, why is it not possible to do auto-failover with 3 or less nodes?

1 Answer

« Back to question.

Hello,

To continue from the comment discussion.

You cannot enable auto-failover on a 2 node cluster (and the auto-failover will not happen when you cluster is reduced to 2 nodes) because you cannot move to a 1 node cluster... in this case if you end up with a single node cluster you do not have any replicas so you may lose data.

This is just a safety net that force deployment to always have 2 node or more.

Regards
Tug
@tgrall