I’m running into an issue with a 3-node production couchbase cluster running on azure - basically, after a series of automatic failovers the cluster never comes back to “working” status. This leaves the server page of the couchbase UI showing a permanent “Orange” Node like this one:
All buckets are configured with replicas = 1, so affected buckets show as “Pending”. Some are green, some are orange with the message “1 node pending”.
Kubernetes settings (Operator 2.1.0, helm chart):
autoFailoverTimeout: 120s
autoFailoverMaxCount: 3
autoFailoverOnDataDiskIssues: true
autoFailoverOnDataDiskIssuesTimePeriod: 120s
On the kubernetes side, we have PVCs set to “Retain” - I think that pods are being killed and automatically recreating, the pods named like “oaf-couchbase-0000” show 0 restarts and ages less than 1 day (also manually killing a pod and letting it come back recreates the error). The couchbase operator pod is 6 weeks old (as expected), but showing 12 restarts (which is not). Kubernetes Logs are showing a repetitive attempt to rebalance:
{“level”:“info”,“ts”:1615804307.2085128,“logger”:“cluster”,“msg”:“Reconcile completed”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804308.2641103,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\"
,\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:46Z\"
,\n- \t lastUpdateTime: \"2021-03-15T10:31:46Z\"
,\n- \t" message: The operator is attempting to rebalance the data to correct this issue",\n- \t" reason: Unbalanced",\n- \t status: \"False\"
,\n+ \t- lastTransitionTime: \"2021-03-15T10:31:48Z\"
,\n+ \t lastUpdateTime: \"2021-03-15T10:31:48Z\"
,\n+ \t" message: Data is equally distributed across all nodes in the cluster",\n+ \t" reason: Balanced",\n+ \t status: \"True\"
,\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804308.8199947,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 82 identical lines\n \t- lastTransitionTime: \"2021-03-15T10:31:48Z\"
,\n \t lastUpdateTime: \"2021-03-15T10:31:48Z\"
,\n- \t" message: Data is equally distributed across all nodes in the cluster",\n- \t" reason: Balanced",\n- \t status: \"True\"
,\n+ \t" message: The operator is attempting to rebalance the data to correct this issue",\n+ \t" reason: Unbalanced",\n+ \t status: \"False\"
,\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804309.0740209,“logger”:“cluster”,“msg”:“Cluster status”,“cluster”:“couchbase/oaf-couchbase”,“balance”:“unbalanced”,“rebalancing”:false}
{“level”:“info”,“ts”:1615804309.074079,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0000”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Warmup”}
{“level”:“info”,“ts”:1615804309.074094,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0001”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Active”}
{“level”:“info”,“ts”:1615804309.0741038,“logger”:“cluster”,“msg”:“Node status”,“cluster”:“couchbase/oaf-couchbase”,“name”:“oaf-couchbase-0002”,“version”:“6.6.0”,“class”:“default”,“managed”:true,“status”:“Active”}
{“level”:“info”,“ts”:1615804309.4087725,“logger”:“cluster”,“msg”:“Pods warming up, skipping”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804309.6084857,“logger”:“cluster”,“msg”:“Reconcile completed”,“cluster”:“couchbase/oaf-couchbase”}
{“level”:“info”,“ts”:1615804310.4462156,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\"
,\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:48Z\"
,\n- \t lastUpdateTime: \"2021-03-15T10:31:48Z\"
,\n- \t" message: The operator is attempting to rebalance the data to correct this issue",\n- \t" reason: Unbalanced",\n- \t status: \"False\"
,\n+ \t- lastTransitionTime: \"2021-03-15T10:31:50Z\"
,\n+ \t lastUpdateTime: \"2021-03-15T10:31:50Z\"
,\n+ \t" message: Data is equally distributed across all nodes in the cluster",\n+ \t" reason: Balanced",\n+ \t status: \"True\"
,\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804311.2113588,“logger”:“cluster”,“msg”:“Resource updated”,“cluster”:“couchbase/oaf-couchbase”,“diff”:" strings.Join({\n \t… // 80 identical lines\n \t status: \"True\"
,\n \t" type: Available",\n- \t- lastTransitionTime: \"2021-03-15T10:31:50Z\"
,\n- \t lastUpdateTime: \"2021-03-15T10:31:50Z\"
,\n- \t" message: Data is equally distributed across all nodes in the cluster",\n- \t" reason: Balanced",\n- \t status: \"True\"
,\n+ \t- lastTransitionTime: \"2021-03-15T10:31:51Z\"
,\n+ \t lastUpdateTime: \"2021-03-15T10:31:51Z\"
,\n+ \t" message: The operator is attempting to rebalance the data to correct this issue",\n+ \t" reason: Unbalanced",\n+ \t status: \"False\"
,\n \t" type: Balanced",\n \t"currentVersion: 6.6.0",\n \t… // 18 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1615804311.4906135,“logger”:“cluster”,“msg”:“Cluster status”,“cluster”:“couchbase/oaf-couchbase”,“balance”:“unbalanced”,“rebalancing”:false}
However, no UI interactions or pod deletions seem to get the Orange node back to Green. What I’ve tried:
Rebalancing: Pressing the rebalance button gives the error message:
Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0000.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = b2cc3ae1f4419ff22c76d48d05251029
Failover: Attempting to run a hard failover through the UI gives the error message:
“failover interrupted due to auto-failover of nodes”
Remove the Server: Removing the Server gives the message:
“Node flagged for removal | still taking traffic | REMOVAL pending rebalance”
However, because the rebalance can’t happen, the removal never finishes.
This has been multiple node failures in a short amount of time, I can’t say why (maybe an Azure internal or something) but all 3 couchbase pods were killed within 4 hours of each other. The first time this happened the cluster eventually recovered - I had to press “Reset Auto-Failover Quota”, afterward all 3 nodes returned to Green status but the cluster was unbalanced - node 0002 has almost no data, node 0000 was just coming out of the orange state.
During the rebalance that was running right afterward, Node 1 went offline; it’s been 4 hours and it’s still in the “Orange” state - these are the logs from the time:
Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 7fc24b915d4e6e1f506a4f21c3db78b0
Could not automatically fail over nodes ([‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’]). Would lose vbuckets in the following buckets: [“fs-bucket-v0”,“default”,
“rostercache”]
Starting graceful failover of nodes [‘ns_1@oaf-couchbase-0002.oaf-couchbase.couchbase.svc’]. Operation Id = 5c639a6b83ddc25bd37ab7fd25196fd2
Graceful failover interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 5c639a6b83ddc25bd37ab7fd25196fd2
Starting rebalance, KeepNodes = [‘ns_1@oaf-couchbase-0000.oaf-couchbase.couchbase.svc’,
‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’,
‘ns_1@oaf-couchbase-0002.oaf-couchbase.couchbase.svc’], EjectNodes = , Failed over and being ejected nodes = ; no delta recovery nodes; Operation Id = 80dbea24ad2dcd9c358cd83ce89473ff
Rebalance interrupted due to auto-failover of nodes [‘ns_1@oaf-couchbase-0001.oaf-couchbase.couchbase.svc’].
Rebalance Operation Id = 80dbea24ad2dcd9c358cd83ce89473ff
I also see several hundred of these errors throughout the day:
Service ‘cbas’ exited with status 1. Restarting. Messages:
2021-03-15T15:14:33.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:33.604+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:33.604+00:00 INFO CBAS.cbas driver bootstrap failed; will retry in 1s (2 retries remaining)
2021-03-15T15:14:34.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:34.604+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:34.604+00:00 INFO CBAS.cbas driver bootstrap failed; will retry in 1s (1 retries remaining)
2021-03-15T15:14:35.604+00:00 CRIT CBAS.cbas my node is in topology but not registered in cbauth db
2021-03-15T15:14:35.605+00:00 WARN CBAS.cbas driver bootstrap attempt failed: my node is in topology but not registered in cbauth db
2021-03-15T15:14:35.605+00:00 FATA CBAS.cbas unable to complete driver bootstrap after 60 attempt(s) (last failure: my node is in topology but not registered in cbauth db)
This is running a production service, so I’d prefer not to tear down and recreate the whole cluster if I can avoid it. Any ideas on how to recover from this situation?