Couchbase 4.1.0-5005 Crashing regularly after upgrade from 3.0.3-1716

Kadderin · January 20, 2017, 7:23pm

3 nodes in the cluster are reporting an error and restarting.

Service ‘goxdcr’ exited with status 1. Restarting. Messages: MetadataService 2017-01-08T00:36:40.223Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2017-01-08T00:36:40.224Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2017-01-08T00:36:40.224Z [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
Error starting remote cluster service. err=metakv failed for max number of retries = 5
[goport] 2017/01/08 00:36:40 /opt/couchbase/bin/goxdcr terminated: exit status 1

Full output of the entire crash happening on multiple nodes is attached.

Couchbase_Err_log.zip (27.1 KB)

WillGardella · January 20, 2017, 10:58pm

Can you double check that the system is sized appropriately?

I checked JIRA and many of the cases where this appears, the root cause was an undersized system that was leading to cascading failures. Here’s one example: https://issues.couchbase.com/browse/MB-16187

Kadderin · January 20, 2017, 11:29pm

All of the locations that I was reading about with similar errors seemed to be AWS/cloud locations. The cluster that this is occurring on is physical dedicated hardware:

16 Nodes

256G RAM total per server
2x 8 core CPUs per server

Total Data Allocation 2.1TB
Data RAM per server: 138.24GB
Index RAM Quota : 256MB
Disk per server: 2.15TB

4 Buckets Usage/RAM Quota
data - 78.5GB /100GB
default - 102MB/2GB
profile - 1.58TB/2TB
storage - 62.1MB/77.6MB

Kadderin · November 15, 2017, 1:41am

This issue is still happening. Are there any resolutions for this yet? Upgrading to 4.1.2 looks to be the suggested solution from this JIRA issue: https://issues.couchbase.com/browse/MB-16766

drigby · November 15, 2017, 1:23pm

Any reason you’re not running the most recent 4.x (4.5.1 CE, 4.6.3 EE), or even version 5.0 ?

Kadderin · November 15, 2017, 6:42pm

It’s very risky to upgrade this cluster due to the segfaults. The rebalance operations take roughly 8 hours to complete in a 16 node cluster. I’ve already upgraded from 3.x to 4.1 to resolve a separate critical bug.

alexegli · November 27, 2017, 3:12pm

Was the cause of this ever identified? I’ve started to see this issue happening weekly at the same time every sunday for the past 3 sundays. My cluster is Couchbase CE 4.1.1, two nodes with about <5% CPU usage per node and 70% RAM usage per node on average. So my cluster is definitely not undersized. If the root cause of this was definitively found and fixed could you please let me know what version it was fixed in?

Kadderin · February 21, 2019, 12:35am

@alexegli we managed to upgrade the cluster to 4.5.1 and this resolved our issues.

Topic		Replies	Views
Couchbase Version 4.0.0-4051 Crashing Regularly on Ubuntu AWS Instance Couchbase Server	12	9354	May 5, 2016
Crash couchbase error : goxdcr Couchbase Server	0	2266	June 8, 2017
Couchbase server stop responding Couchbase Server	0	1636	November 12, 2016
Windows instance not listening on 8093 Couchbase Server	13	2908	December 14, 2016
Problem with rebalance Couchbase Server	0	1554	December 14, 2016

Couchbase 4.1.0-5005 Crashing regularly after upgrade from 3.0.3-1716

Related topics