.NET SDK takes a long time to recover from an outage

I’m trying to reproduce some problems that we have while performing Couchbase maintenance with both the Java and .NET clients. Basically, I set up a testing cluster - 3 nodes - no auto-failover.

I run a .NET application that reads and writes from the cluster. If I shut down the cluster (all nodes) and then re-start it… It can take about 5 minutes until all requests succeed again… In the meantime, I get a lot of Timeouts.

Very little traffic going on on the cluster and the cluster quickly reports all green when restarted

This is a legacy app and must be using multiplexing. We’re on 2.4.7 - did we get flipped to Multiplexing for free? :slight_smile:

If I restart the cluster and quickly start the application from down, it works perfectly immediately.

Hi @unhuman -

Have you tried the same test using 2.5.2 (I don’t recommend 2.5.0 or 2.5.1)? We improved fail-over scenarios significantly or at least the detection in 2.5.X.

Yes, you now are using MUX - note that its not pooled in 2.4.X, but it is pooled in 2.5.X.

Note that when you shutdown a cluster without using failover, the SDK does not know that the cluster is down and keeps trying to connect using its current cluster map. At this point everything slows down as requests queue up and timeout. The server itself has to go through a warm up state while starting up and the additional requests put more pressure on it. This would not be the suggested way to perform cluster maintenance.

-Jeff

Thanks @jmorris. I’ll play around with 2.5.2…

I’ll clarify that this is not a real-world scenario, but I’m experimenting with various ways to fail a cluster - we have services that become unresponsive during maintenance and require a restart; more so with Java, but we’ve seen it with .NET services as well. I do not think cluster warm up is part of this problem given that restarting the service works properly immediately. Just recovery seems slow and inconsistent (things start working intermittently - probably some nodes interacting properly before others). The gaps are significant.

When we perform maintenance, we use a process vetted by Couchbase support… Still have those issues. This makes upgrading Couchbase Server a challenge for us.

EDIT: using 2.5.2, the results are not much different in this contrived example. Takes a long time for everything to resolve properly again.

-H