.NET SDK takes a long time to recover from an outage

unhuman · November 28, 2017, 10:33pm

I’m trying to reproduce some problems that we have while performing Couchbase maintenance with both the Java and .NET clients. Basically, I set up a testing cluster - 3 nodes - no auto-failover.

I run a .NET application that reads and writes from the cluster. If I shut down the cluster (all nodes) and then re-start it… It can take about 5 minutes until all requests succeed again… In the meantime, I get a lot of Timeouts.

Very little traffic going on on the cluster and the cluster quickly reports all green when restarted

This is a legacy app and must be using multiplexing. We’re on 2.4.7 - did we get flipped to Multiplexing for free?

If I restart the cluster and quickly start the application from down, it works perfectly immediately.

jmorris · November 29, 2017, 3:28am

Hi @unhuman -

Have you tried the same test using 2.5.2 (I don’t recommend 2.5.0 or 2.5.1)? We improved fail-over scenarios significantly or at least the detection in 2.5.X.

Yes, you now are using MUX - note that its not pooled in 2.4.X, but it is pooled in 2.5.X.

Note that when you shutdown a cluster without using failover, the SDK does not know that the cluster is down and keeps trying to connect using its current cluster map. At this point everything slows down as requests queue up and timeout. The server itself has to go through a warm up state while starting up and the additional requests put more pressure on it. This would not be the suggested way to perform cluster maintenance.

-Jeff

unhuman · November 29, 2017, 1:34pm

Thanks @jmorris. I’ll play around with 2.5.2…

I’ll clarify that this is not a real-world scenario, but I’m experimenting with various ways to fail a cluster - we have services that become unresponsive during maintenance and require a restart; more so with Java, but we’ve seen it with .NET services as well. I do not think cluster warm up is part of this problem given that restarting the service works properly immediately. Just recovery seems slow and inconsistent (things start working intermittently - probably some nodes interacting properly before others). The gaps are significant.

When we perform maintenance, we use a process vetted by Couchbase support… Still have those issues. This makes upgrading Couchbase Server a challenge for us.

EDIT: using 2.5.2, the results are not much different in this contrived example. Takes a long time for everything to resolve properly again.

-H

Topic		Replies	Views
.NET SDK fails to recover after cluster automatic failover .NET SDK	37	4697	December 27, 2019
SDK hangs when node in cluster fails over .NET SDK	11	2722	March 27, 2015
Client pool doesn't recover after failure (client 1.3.11 / server 2.2.0) .NET SDK	6	2349	July 16, 2015
Couchbase NodeUnavilable ,.Net SDK .NET SDK dot-net	3	2216	November 11, 2015
.NET Client Behavior During Node Failure .NET SDK dot-net	9	4364	August 2, 2015

.NET SDK takes a long time to recover from an outage

Related topics