We have a Couchbase cluster deployed to Kubernetes as a StatefulSet
, and we connect to it from an ASP.NET Core application, using the .NET SDK.
When we initialize the SDK, we use a connection string like this:
couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local,couchbase-pci-1.couchbase-pci.couchbase.svc.cluster.local
Where the address couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local
identifies a specific pod of the StatefulSet.
The problem we are encountering is that when we redeploy the nodes of the cluster (or the pods get recreated for some other reason), our application loses connectivity, and all Couchbase operations start failing with timeouts.
(Redeploying the cluster happens by running couchbase-cli failover
for a node, then creating a new node and adding it to the cluster with couchbase-cli server-add
, then doing couchbase-cli rebalance
, and repeating this for each node.)
This is what the pods of the StatefulSet look like before redeploying:
$ kubectl get po -n couchbase -o wide
NAME READY STATUS RESTARTS AGE IP NODE
couchbase-pci-0 2/2 Running 0 106m 10.0.0.25 gke-stg-pci-default-n1-standard-8-38215054-ix88
couchbase-pci-1 2/2 Running 0 108m 10.0.6.65 gke-stg-pci-default-n1-standard-8-b137cf23-ouat
Once we redeploy the cluster, the pods are recreated, and they get new IP addresses, as expected:
$ kubectl get po -n couchbase -o wide
NAME READY STATUS RESTARTS AGE IP NODE
couchbase-pci-0 2/2 Running 0 2m14s 10.0.0.26 gke-stg-pci-default-n1-standard-8-38215054-ix88
couchbase-pci-1 2/2 Running 0 4m26s 10.0.6.67 gke-stg-pci-default-n1-standard-8-b137cf23-ouat
After this happens, all Couchbase operations in our application start failing with timeout errors, and the application never recovers from this state, it keeps failing until it’s redeployed (after which it starts working normally again).
In my logs I see this exception surfaced by the Couchbase SDK:
The Couchbase SDK returned an error. - [Couchbase.Core.Exceptions.UnambiguousTimeoutException]: The operation /6819 timed out after 00:00:02.5000000.
And in the SDK logs I see messages like this:
Issue getting Cluster Map on server 10.0.0.25:11210!
Error replacing dead connections for 10.0.6.65:11210.
Error replacing dead connections for 10.0.0.25:11210.
Issue getting Cluster Map on server 10.0.6.65:11210!
Error replacing dead connections for 10.0.6.65:11210.
...
This suggests that the SDK is still trying to connect to the old IP addresses, and is not discovering the new nodes.
The connection string I’m using is the following:
couchbase-pci-0.couchbase-pci.couchbase.svc.cluster.local,couchbase-pci-1.couchbase-pci.couchbase.svc.cluster.local
Where the two hostnames are supposed to identify the two individual pods in the StatefulSet
.
In the admin UI I see this on the Servers screen:
I assume this means that the nodes in the cluster are correctly configured to use these hostnames, and not just IP addresses. So I would expect the SDK to find out the new IP address of the nodes by resolving these hostnames again, is that correct?
But based on the errors, it seems that the SDK stays stuck with the old IP addresses forever.
Could anyone advise what might be wrong, or how this can be fixed? Is something wrong with the cluster config, the Kubernetes setup, or with the way we’re using the SDK?