Couchbase 4.1.0-5005 Crashing regularly after upgrade from 3.0.3-1716

3 nodes in the cluster are reporting an error and restarting.

Service ‘goxdcr’ exited with status 1. Restarting. Messages: MetadataService 2017-01-08T00:36:40.223Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2017-01-08T00:36:40.224Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2017-01-08T00:36:40.224Z [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
Error starting remote cluster service. err=metakv failed for max number of retries = 5
[goport] 2017/01/08 00:36:40 /opt/couchbase/bin/goxdcr terminated: exit status 1

Full output of the entire crash happening on multiple nodes is attached.

Couchbase_Err_log.zip (27.1 KB)

Can you double check that the system is sized appropriately?

I checked JIRA and many of the cases where this appears, the root cause was an undersized system that was leading to cascading failures. Here’s one example: https://issues.couchbase.com/browse/MB-16187

All of the locations that I was reading about with similar errors seemed to be AWS/cloud locations. The cluster that this is occurring on is physical dedicated hardware:

16 Nodes

256G RAM total per server
2x 8 core CPUs per server

Total Data Allocation 2.1TB
Data RAM per server: 138.24GB
Index RAM Quota : 256MB
Disk per server: 2.15TB

4 Buckets Usage/RAM Quota
data - 78.5GB /100GB
default - 102MB/2GB
profile - 1.58TB/2TB
storage - 62.1MB/77.6MB

This issue is still happening. Are there any resolutions for this yet? Upgrading to 4.1.2 looks to be the suggested solution from this JIRA issue: https://issues.couchbase.com/browse/MB-16766

Any reason you’re not running the most recent 4.x (4.5.1 CE, 4.6.3 EE), or even version 5.0 ?

It’s very risky to upgrade this cluster due to the segfaults. The rebalance operations take roughly 8 hours to complete in a 16 node cluster. I’ve already upgraded from 3.x to 4.1 to resolve a separate critical bug.

Was the cause of this ever identified? I’ve started to see this issue happening weekly at the same time every sunday for the past 3 sundays. My cluster is Couchbase CE 4.1.1, two nodes with about <5% CPU usage per node and 70% RAM usage per node on average. So my cluster is definitely not undersized. If the root cause of this was definitively found and fixed could you please let me know what version it was fixed in?

1 Like

@alexegli we managed to upgrade the cluster to 4.5.1 and this resolved our issues.