One Node Permanently Orange and Pending (Couchbase Enterprise 6.6.0, Operator 2.1.0)

I’m running into an issue with a 3-node production couchbase cluster running on azure - basically, after a series of automatic failovers the cluster never comes back to “working” status. This leaves the server page of the couchbase UI showing a permanent “Orange” Node like this one:

All buckets are configured with replicas = 1, so affected buckets show as “Pending”. Some are green, some are orange with the message “1 node pending”.

Kubernetes settings (Operator 2.1.0, helm chart):

autoFailoverTimeout: 120s
autoFailoverMaxCount: 3
autoFailoverOnDataDiskIssues: true
autoFailoverOnDataDiskIssuesTimePeriod: 120s

On the kubernetes side, we have PVCs set to “Retain” - I think that pods are being killed and automatically recreating, the pods named like “oaf-couchbase-0000” show 0 restarts and ages less than 1 day (also manually killing a pod and letting it come back recreates the error). The couchbase operator pod is 6 weeks old (as expected), but showing 12 restarts (which is not). Kubernetes Logs are showing a repetitive attempt to rebalance:

{“level”:“info”,“ts”:1615804307.2085128,“logger”:“cluster”,“msg”:“Reconcile completed”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804308.2641103,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\",\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:46Z\",\n- \t lastUpdateTime: \"2021-03-15T10:31:46Z\",\n- \t" message: The operator is attempting to rebalance the data to correct this issue",\n- \t" reason: Unbalanced",\n- \t status: \"False\",\n+ \t- lastTransitionTime: \"2021-03-15T10:31:48Z\",\n+ \t lastUpdateTime: \"2021-03-15T10:31:48Z\",\n+ \t" message: Data is equally distributed across all nodes in the cluster",\n+ \t" reason: Balanced",\n+ \t status: \"True\",\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804308.8199947,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 82 identical lines\n \t- lastTransitionTime: \"2021-03-15T10:31:48Z\",\n \t lastUpdateTime: \"2021-03-15T10:31:48Z\",\n- \t" message: Data is equally distributed across all nodes in the cluster",\n- \t" reason: Balanced",\n- \t status: \"True\",\n+ \t" message: The operator is attempting to rebalance the data to correct this issue",\n+ \t" reason: Unbalanced",\n+ \t status: \"False\",\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804309.0740209,“logger”:“cluster”,“msg”:“Cluster status”,“cluster”:“couchbase/oaf-couchbase”,“balance”:“unbalanced”,“rebalancing”:false}
{“level”:“info”,“ts”:1615804309.074079,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0000”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Warmup”}
{“level”:“info”,“ts”:1615804309.074094,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0001”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Active”}
{“level”:“info”,“ts”:1615804309.0741038,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0002”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Active”}
{“level”:“info”,“ts”:1615804309.4087725,“logger”:“cluster”,“msg”:“Pods warming up, skipping”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804309.6084857,“logger”:“cluster”,“msg”:“Reconcile completed”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804310.4462156,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\",\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:48Z\",\n- \t lastUpdateTime: \"2021-03-15T10:31:48Z\",\n- \t" message: The operator is attempting to rebalance the data to correct this issue",\n- \t" reason: Unbalanced",\n- \t status: \"False\",\n+ \t- lastTransitionTime: \"2021-03-15T10:31:50Z\",\n+ \t lastUpdateTime: \"2021-03-15T10:31:50Z\",\n+ \t" message: Data is equally distributed across all nodes in the cluster",\n+ \t" reason: Balanced",\n+ \t status: \"True\",\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804311.2113588,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\",\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:50Z\",\n- \t lastUpdateTime: \"2021-03-15T10:31:50Z\",\n- \t" message: Data is equally distributed across all nodes in the cluster",\n- \t" reason: Balanced",\n- \t status: \"True\",\n+ \t- lastTransitionTime: \"2021-03-15T10:31:51Z\",\n+ \t lastUpdateTime: \"2021-03-15T10:31:51Z\",\n+ \t" message: The operator is attempting to rebalance the data to correct this issue",\n+ \t" reason: Unbalanced",\n+ \t status: \"False\",\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804311.4906135,“logger”:“cluster”,“msg”:“Cluster status”,“cluster”:“couchbase/oaf-couchbase”,“balance”:“unbalanced”,“rebalancing”:false}

However, no UI interactions or pod deletions seem to get the Orange node back to Green. What I’ve tried:

Rebalancing: Pressing the rebalance button gives the error message:

Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0000.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = b2cc3ae1f4419ff22c76d48d05251029

Failover: Attempting to run a hard failover through the UI gives the error message:
“failover interrupted due to auto-failover of nodes”

Remove the Server: Removing the Server gives the message:
“Node flagged for removal | still taking traffic | REMOVAL pending rebalance”

However, because the rebalance can’t happen, the removal never finishes.

This has been multiple node failures in a short amount of time, I can’t say why (maybe an Azure internal or something) but all 3 couchbase pods were killed within 4 hours of each other. The first time this happened the cluster eventually recovered - I had to press “Reset Auto-Failover Quota”, afterward all 3 nodes returned to Green status but the cluster was unbalanced - node 0002 has almost no data, node 0000 was just coming out of the orange state.

During the rebalance that was running right afterward, Node 1 went offline; it’s been 4 hours and it’s still in the “Orange” state - these are the logs from the time:

Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 7fc24b915d4e6e1f506a4f21c3db78b0

Could not automatically fail over nodes ([‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’]). Would lose vbuckets in the following buckets: [“fs-bucket-v0”,“default”,
“rostercache”]

Starting graceful failover of nodes [‘ns_1@oaf-couchbase-0002.oaf-couchbase.couchbase.svc’]. Operation Id = 5c639a6b83ddc25bd37ab7fd25196fd2

Graceful failover interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 5c639a6b83ddc25bd37ab7fd25196fd2

Starting rebalance, KeepNodes = [‘ns_1@oaf-couchbase-0000.oaf-couchbase.couchbase.svc’,
‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’,
‘ns_1@oaf-couchbase-0002.oaf-couchbase.couchbase.svc’], EjectNodes = , Failed over and being ejected nodes = ; no delta recovery nodes; Operation Id = 80dbea24ad2dcd9c358cd83ce89473ff

Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 80dbea24ad2dcd9c358cd83ce89473ff

I also see several hundred of these errors throughout the day:

Service ‘cbas’ exited with status 1. Restarting. Messages:
2021-03-15T15:14:33.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:33.604+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:33.604+00:00 INFO CBAS.cbas driver bootstrap failed; will retry in 1s (2 retries remaining)
2021-03-15T15:14:34.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:34.604+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:34.604+00:00 INFO CBAS.cbas driver bootstrap failed; will retry in 1s (1 retries remaining)
2021-03-15T15:14:35.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:35.605+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:35.605+00:00 FATA CBAS.cbas unable to complete driver bootstrap after 60 attempt(s) (last failure: my node is in topology but not registered in cbauth db)

This is running a production service, so I’d prefer not to tear down and recreate the whole cluster if I can avoid it. Any ideas on how to recover from this situation?

Update: after 8 hours, this issue fixed itself without leaving any obvious log messages on kubernetes or couchbase (a node removal worked successfully, otherwise I made no changes). I still don’t know why the original failures happened or how to prevent the issue in the future - any suggestions?

Update #2: In case anyone else runs into the same problem. The nodes were stuck in the warmup state for an excessively long time because I had a lot of view indexes set up on them - deleting these view indexes helped substantially to reduce the warmup interval. The root cause of the problem was memory competition on kubernetes, on my original helm chart I had failed to specify resource limits so my pods were being evicted frequently. Adding resource limits like these:

servers:
# Name for the server configuration. It must be unique.
default:
# Size of the couchbase cluster.
size: 3
# The services to run on nodes
services:
- data
- index
- query
- search
- eventing
- analytics
volumeMounts:
default: couchbase-volume
resources:
limits:
cpu: “4”
memory: 16Gi
requests:
cpu: “1”
memory: 14Gi

Fixed the issue - after reducing RAM to a bare minimum and doing some rebalances the new pods seem much more stable.