.NET Client Behavior During Node Failure

craigkoster · July 10, 2015, 2:19pm

Hi All,

We are currently evaluating Couchbase and are doing some testing with the .NET SDK (we are a Microsoft shop). We have a test three node cluster and have created a small test program that spins up 50 threads that do 1000 upserts each. Couchbase and the .NET SDK handle this very well during normal operations as well as when we do a manual graceful failover of one node - no errors or data loss occurs and we see the expected results at the end of the run. However, we are now testing the scenario where one of the three nodes crashes (we are simulating this by simply turning off the CouchbaseServer Windows service) and the .NET SDK doesn’t seem to be behaving as we were expecting.

It may just be a misunderstanding on our part, but in looking at the SDK client configuration and taking a peek at the code we were under the impression that if a node goes down the client will gracefully bounce to using another available node to perform operations. What we are seeing is that the client just gives the message “The node that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000 ms.” and keeps failing over and over. I pulled the SDK code from Github and stepped into what was going on and I do see that the Server._isDown is correctly set to true for the failed node but I don’t see anywhere in the code where this condition would cause the operation to change affinity from the downed server to a live one.

Are we misunderstanding how the client is supposed to behave in this scenario? Here is our client config for reference:

var clientConfiguration = new ClientConfiguration
        {
            Servers = new List<Uri>
            {
                new Uri("http://server1.XXXX.com:8091/pools"),
                new Uri("http://server2.XXXX.com:8091/pools"),
                new Uri("http://server3.XXXX.com:8091/pools")
            }
        };

        var cluster = new Cluster(clientConfiguration);
        var bucket = cluster.OpenBucket("load-testing");

Any help or direction you could provide would be much appreciated.

Regards,

Craig

envitraux · July 10, 2015, 2:52pm

I had a similar problem. You need to use the clusterhelper to manage connections. Starting a new connection takes time and the helper will keep alive connections.

Keith

jmorris · July 13, 2015, 5:17pm

It depends upon the operation and more importantly if you are using replica reads. Since keys are mapped directly to the node that they exist on, any mutation operation will fail for keys mapped to the down node. In this case, NodeUnavailable will be returned plus the message The node that the key was mapped to is either down or unreachable. If the operation is a Get and if you have replicas enabled, you can follow up with a replica read if NodeUnavailable is encountered.

When the _isDown flag is set, for that node, a timer will fire every 1000ms (configurable via ClientConfiguration.NodeAvailableCheckInterval) and a NOOP will be attempted on that node on a separate thread. Once the NOOP completes the successfully, the _isDown flag will be set to false and the node will come back online.

-Jeff

itay · July 23, 2015, 5:09am

My understanding is as @jmorris explained. That is if a node is down, a MANUAL check of the error message should invoke an additional call to the cluster with the Replica flag on. This should be done every GET regardless if the node is up or not.

Can an UPDATE be called with a Replica flag also ?

Itay

jmorris · July 23, 2015, 4:39pm

No, since an update will be to the master first; eventually the delta will be replicated out to any nodes configured to support replicas.

-Jeff

itay · July 23, 2015, 5:48pm

Thanks, @jmorris,

So, for example, if one node in a cluster of 6 fails, then the entire app is halted until the failed node is fixed (assuming that updates are a necessary part of the app execution) and there is no hot replacement ?

jmorris · July 23, 2015, 6:02pm

@itay -

No, the node will be put into a temporary down state and NodeUnavailable wil be returned to any key mapped to it; all keys mapped to nodes other the down node will be processed.

-Jeff

itay · July 23, 2015, 6:57pm

So practically, having a multi-node cluster is important mainly for scaling but not for availability since if 1 node fails and the app continues, data consistency will be jeopardize (due to partial writes).

Is there a plan to direct writes to replicas instead of to the failed node ?

vmaleev · August 2, 2015, 10:33pm

Hello,
Same question from me. I saw same behavior with .NET.
I tried to store IIS session in the Couchbase and found that some sessions were broken when one node goes down.
How fast replica will become to main node in the cluster?

ingenthr · August 2, 2015, 11:10pm

You definitely have automatic high availability if you have 3 or more nodes and auto-failover enabled. Auto-failover failover is not on by default.

Topic		Replies	Views
How to handle node failure in the cluster .NET SDK connections , dot-net	1	961	June 27, 2022
.NET SDK fails to recover after cluster automatic failover .NET SDK	37	4887	December 27, 2019
What happens when a node in the cluster goes down? Couchbase Server	14	21990	December 29, 2018
NodeUnavailableException .NET SDK	2	1728	March 7, 2019
Couchbase NodeUnavilable ,.Net SDK .NET SDK dot-net	3	2242	November 11, 2015

.NET Client Behavior During Node Failure

Related topics