Intermittent memcached error on CB 3.0 Enterprise

alexegli · May 4, 2015, 8:57pm

I see this same error in the logs every few days in the couchbase console:

Control connection to memcached on 'ns_1@10.0.0.10' disconnected: {badmatch,
                                                                   {error,
                                                                    timeout}}

Then in the same second it will say:

Bucket "auth" loaded on node 'ns_1@10.0.0.10' in 0 seconds.

I looked into the error.log and saw these details for the first couple occurrences:

[ns_doctor:error,2015-03-22T8:21:30.829,ns_1@10.0.0.30:ns_log<0.11472.0>:ns_doctor:get_node:189]Error attempting to get node 'ns_1@10.0.0.10': {exit,
                                                {noproc,
                                                 {gen_server,call,
                                                  [ns_doctor,
                                                   {get_node,
                                                    'ns_1@10.0.0.10'}]}}}
[stats:error,2015-04-22T17:13:30.067,ns_1@10.0.0.30:<0.29087.6>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                  {{badmatch,{error,timeout}},
                                   {gen_server,call,
                                       ['ns_memcached-auth',
                                        {stats,<<>>},
                                        180000]}},
                                  [{gen_server,call,3,
                                       [{file,"gen_server.erl"},{line,188}]},
                                   {ns_memcached,do_call,3,
                                       [{file,"src/ns_memcached.erl"},
                                        {line,1399}]},
                                   {stats_collector,grab_all_stats,1,
                                       [{file,"src/stats_collector.erl"},
                                        {line,84}]},
                                   {stats_collector,handle_info,2,
                                       [{file,"src/stats_collector.erl"},
                                        {line,116}]},
                                   {gen_server,handle_msg,5,
                                       [{file,"gen_server.erl"},{line,604}]},
                                   {proc_lib,init_p_do_apply,3,
                                       [{file,"proc_lib.erl"},{line,239}]}]}

And these for subsequent occurrences:

[ns_server:error,2015-05-04T19:57:03.107,ns_1@10.0.0.30:ns_doctor<0.11518.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@10.0.0.10': ["auth"], those of them are active ["auth"]
[ns_server:error,2015-05-04T19:57:03.112,ns_1@10.0.0.30:ns_doctor<0.11518.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@10.0.0.30': ["auth"], those of them are active ["auth"]

We’re running Couchbase Server Enterprise 3.0.2 on Ubuntu, in a cluster with 2 nodes and sync gateway on each node. We have only noticed one possible error client side due to this, where a client using a mobile couchbase lite client wasn’t able to replicate all their documents when doing the replication around the time one of these issues occurred. We’re not sure if it’s related though, so is this a normal couchbase error that doesn’t affect anything that we can ignore, or do we have something configured wrong on the server?

Thanks in advance for any help anyone can give on this topic.

ingenthr · May 13, 2015, 2:43pm

It looks like intermittent networking issues. I don’t see anything there regarding restarting processes, so I think it’s safe to say there’s no crash. Do you see any evidence of a crash in further details in the debug.log? That’s where I’d look next.

Since you’re using Enterprise, if you have a subscription you can also request that Couchbase look over your full collect_info output.

alexegli · May 13, 2015, 3:15pm

Thanks! We didn’t notice any crashes, it just seemed to coincide with users trying to connect and sync to couchbase. We’re in the Azure cloud so we should have a good network connection but microsoft has had issues with their cloud before. I’ll look into the debug log more and see if I can find anything.

mliu · March 27, 2017, 7:37pm

Did you ever get to the bottom of this? I’m seeing this every so often as well and twice in one day just now; and a automatic failover is initiated after this.

On Google Cloud networks, wondering if there’s any special we have to tune like tcp keeaplives etc. for memcached.

On 3.0.1

alexegli · March 27, 2017, 7:40pm

We’re on Azure and we stopped looking into this because Couchbase support didn’t find anything bad in our logs. The default timeout on Azure VMs though is 4 or 5 minutes so I generally override that to 15 minutes to allow for heartbeats between our iOS clients and the server for continuous pull replication, though it doesn’t seem to work since sync gateway will just stop sending keepalives sometimes for no discernable reason.

mliu · March 27, 2017, 7:53pm

K, thx for info.

Supposedly google has a socket timeout of 10min. Default debian tcp_keepalive_time is set to 7200s/2hrs, so definitely above the 10min. Going to set to 5min to see if any difference.

https://groups.google.com/forum/#!searchin/gce-discussion/compute$20network$20timeout|sort:relevance/gce-discussion/AxaHhT_Q2LY/dSw-rk5KDQAJ

Can someone from couchbase confirm that the inter-node communication sockets in either couchbase or memecached utilize the keepalive option when opening sockets? Otherwise, this configuration is useless…

mliu · March 27, 2017, 8:17pm

just confirmed with netstat that keepalive is used on the memcached sockets, so we’ll see if tuning keepalive is helpful.

drigby · March 28, 2017, 9:49am

It’s probably worth highlighting that the 3.x Enterprise is End-Of-Life as of Feb 2017. I’d recommend moving to the most recent release - 4.6.1 EE (or 4.5.0 CE).

Topic		Replies	Views
Intermittent failures in couchbase server Couchbase Server webconsole , server	4	3616	April 4, 2017
Control connection to memcached on 'ns_1@127.0.0.1' disconnected 1 Couchbase Server	2	2052	March 4, 2020
Crash after logging due to memcache Couchbase Server	1	2317	May 16, 2016
Constant problem with Couchbase 2.0.1 Couchbase Server	1	1755	October 21, 2013
Service 'memcached' exited with status 139. Restarting. Messages Couchbase Server	2	982	January 24, 2023

Intermittent memcached error on CB 3.0 Enterprise

Related topics