.NET SDK loses connectivity after redeploying the cluster

markvincze · September 21, 2021, 12:26pm

We have a Couchbase cluster deployed to Kubernetes as a StatefulSet, and we connect to it from an ASP.NET Core application, using the .NET SDK.

When we initialize the SDK, we use a connection string like this:

couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local,couchbase-pci-1.couchbase-pci.couchbase.svc.cluster.local

Where the address couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local identifies a specific pod of the StatefulSet.

The problem we are encountering is that when we redeploy the nodes of the cluster (or the pods get recreated for some other reason), our application loses connectivity, and all Couchbase operations start failing with timeouts.
(Redeploying the cluster happens by running couchbase-cli failover for a node, then creating a new node and adding it to the cluster with couchbase-cli server-add, then doing couchbase-cli rebalance, and repeating this for each node.)

This is what the pods of the StatefulSet look like before redeploying:

$ kubectl get po -n couchbase -o wide
NAME              READY   STATUS    RESTARTS   AGE    IP          NODE
couchbase-pci-0   2/2     Running   0          106m   10.0.0.25   gke-stg-pci-default-n1-standard-8-38215054-ix88
couchbase-pci-1   2/2     Running   0          108m   10.0.6.65   gke-stg-pci-default-n1-standard-8-b137cf23-ouat

Once we redeploy the cluster, the pods are recreated, and they get new IP addresses, as expected:

$ kubectl get po -n couchbase -o wide
NAME              READY   STATUS    RESTARTS   AGE     IP          NODE                                           
couchbase-pci-0   2/2     Running   0          2m14s   10.0.0.26   gke-stg-pci-default-n1-standard-8-38215054-ix88
couchbase-pci-1   2/2     Running   0          4m26s   10.0.6.67   gke-stg-pci-default-n1-standard-8-b137cf23-ouat

After this happens, all Couchbase operations in our application start failing with timeout errors, and the application never recovers from this state, it keeps failing until it’s redeployed (after which it starts working normally again).

In my logs I see this exception surfaced by the Couchbase SDK:

The Couchbase SDK returned an error. - [Couchbase.Core.Exceptions.UnambiguousTimeoutException]: The operation /6819 timed out after 00:00:02.5000000.

And in the SDK logs I see messages like this:

Issue getting Cluster Map on server 10.0.0.25:11210!
Error replacing dead connections for 10.0.6.65:11210.
Error replacing dead connections for 10.0.0.25:11210.
Issue getting Cluster Map on server 10.0.6.65:11210!
Error replacing dead connections for 10.0.6.65:11210.
...

This suggests that the SDK is still trying to connect to the old IP addresses, and is not discovering the new nodes.

The connection string I’m using is the following:

couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local,couchbase-pci-1.couchbase-pci.couchbase.svc.cluster.local

Where the two hostnames are supposed to identify the two individual pods in the StatefulSet.
In the admin UI I see this on the Servers screen:

I assume this means that the nodes in the cluster are correctly configured to use these hostnames, and not just IP addresses. So I would expect the SDK to find out the new IP address of the nodes by resolving these hostnames again, is that correct?
But based on the errors, it seems that the SDK stays stuck with the old IP addresses forever.

Could anyone advise what might be wrong, or how this can be fixed? Is something wrong with the cluster config, the Kubernetes setup, or with the way we’re using the SDK?

btburnett3 · September 21, 2021, 1:31pm

@markvincze What version of the SDK are you using?

markvincze · September 21, 2021, 1:37pm

@btburnett3 We are using version 3.2.0.

jmorris · September 21, 2021, 3:12pm

@markvincze -

Can you share your Couchbase client configuration?

Jeff

markvincze · September 21, 2021, 3:37pm

@jmorris,

I’m simplifying things a bit, because we’re using Couchbase via an internal wrapper caching library, but we basically do the following, registering with the AddCouchbase extension. (And I omitted the part where we’re reading the values from the appsettings.json.)

services.AddCouchbase(opt =>
    {
        var enableTls = false;
        var connectionString = "couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local,couchbase-pci-1.couchbase-pci.couchbase.svc.cluster.local";
        var userName = "foo";
        var password = "bar;"

        opt.EnableTls = enableTls;
        opt.WithConnectionString(connectionString);
        opt.WithCredentials(userName, password);
    });

As far as I see, these are the only options we are customizing on the client config, everything else is the default.

And then the actual usage happens by injecting IBucketProvider, and accessing the bucket with .GetBucketAsync("baz").

Is this the info you were interested in?

ingenthr · September 21, 2021, 8:51pm

One note on this, with the Couchbase Autonomous Operator, we specifically decided against using stateful sets and with the Operator, we keep the identity of the node in a Persistent Volume.

As you probably know, Kubernetes and DNS management are pretty closely related. I don’t want to go out and say that a StatefulSet can’t work; that said at least Couchbase isn’t pursuing or testing that method of running Couchbase on K8S. You might want to look into running the Operator we have if you want to run under K8S.

Then beyond that, there is an interaction between how Kubernetes CoreDNS works and .NET does nameservice resolution.

I’ll defer to others (I’m not an expert in this area), but I ran into an issue with DNS resolution under Kubernetes with .NET a few months ago. I know I discussed it with @jmorris at the time, but I don’t remember how we addressed it.

I added, at the time, a comment to the DnsClient.NET project. That’s a .NET DNS client, now supporting .NET Core, that contacts nameservers directly. On Linux, it does not use system level libraries or syscalls. It also doesn’t honor resolv.conf. As a result, IIRC, I found that it wouldn’t honor searches specified in that file.

It looks like what you are passing in is fully qualified, but if I recall, one of the issues was that the TTL needs to be much shorter. With K8S DNS, you can reduce DNS caching if IPs are going to resolve differently for a hostname after failure. Unfortunately, the DnsClient.NET doesn’t pick up resolv.conf DNS settings, which is important to how you adjust this on K8S.

I don’t know if this helps that much, but it may give you a pointer or two on how to debug further. It might be that the default TTL is too high, and while the hostname does resolve after the pod is restarted, it resolves to a different IP and it’d take some time before the cache expires for the .NET library to actually ask Kubernetes Core DNS.

To help debug this at the time, I did toss together a demonstration of an Operator based cluster including a .NET Core App to make some changes and do some debugging. @jmorris might recall if we made any changes.

Hope that helps…

btburnett3 · September 21, 2021, 9:07pm

I’d just like to throw in my two cents, for what’s it worth:

The Autonomous Operator is awesome and about 100x better than StatefulSets, in my experience. We switched a couple of years ago and have never looked back. And it’s only gotten better since then.
The downsides of the DnsClient implementation used by the SDK are accurate, and are primarily a result of the limited support for DNS available baked in to .NET. Especially the lack of an API that allows us to resolve SRV records. That said, we could probably switch to using the OS level name resolution for resolving IP addresses, and only use DnsClient for SRV records. This would still, however, leave us with issues around how we cache IPs internally within the SDK, we’d need to address that as well.

markvincze · September 23, 2021, 10:26am

Hey @ingenthr @btburnett3,

Thanks for all the details, and advice!
I’ll take this back to our team to discuss and decide on the next steps.

Topic		Replies	Views
.NET SDK fails to recover after cluster automatic failover .NET SDK	37	4554	December 27, 2019
.NET Core locks up on Cluster.ConnectAsync .NET SDK connections , dot-net	6	1566	May 6, 2021
SDK operation timeouts connecting with the cluster .NET SDK dot-net	7	2738	April 29, 2017
.NET Client Behavior During Node Failure .NET SDK dot-net	9	4275	August 2, 2015
Couchbase NodeUnavilable ,.Net SDK .NET SDK dot-net	3	2173	November 11, 2015

.NET SDK loses connectivity after redeploying the cluster

Related topics