Keyspace not found CBAuth database is stale

Hello Everyone,
we are using Couchbase 4.6.1 as local cache for our service - in this case 1 Couchbase instance per service instance. Using Couchbase C SDK on client. OS Windows server 2012 x64.
Problem we have is: at customer site randomly our bucket becomes unavailable. To fix the problem we have to delete existing bucket and create a new one.
Bucket name is Nuance.
When searching for data getting following error:

status: "fatal", [{"code":12003,"msg":"Keyspace not found keyspace Nuance - cause: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it."}] Error code 59 (HTTP Operation failed. Inspect status code for details)

Last time issue happened at 2018/04/24 15:31:12 and I see seem to be related errors in error.log.1 around that time.
All query.log.x files report “CBAuth database is stale” for number of days before I see error in our logs on April 24.

Appreciate advice on how this issue can be traced down and fixed. I got all Couchbase logs.
Thanks,
Vlad

from query.log.9:

2018-04-23T17:40:17.844-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
_time=2018-04-23T17:40:19.174-07:00 _level=INFO _msg= keyspace Nuance not found CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it. 

from error.log.1:

[ns_server:error,2018-04-24T15:26:20.303-07:00,ns_1@127.0.0.1:<0.42.5>:menelaus_web:loop:187]Server error during processing: ["web request failed",
                                 {path,"/pools/default/buckets/Nuance"},
                                 {method,'GET'},
                                 {type,exit},
                                 {what,
                                  {timeout,
                                   {gen_server,call,[ns_config,get,15000]}}},

[ns_server:error,2018-04-24T15:27:00.257-07:00,ns_1@127.0.0.1:<0.1368.5>:menelaus_web:loop:187]Server error during processing: ["web request failed",
                                 {path,"/pools/default"},
                                 {method,'GET'},
                                 {type,exit},
                                 {what,
                                  {{noproc,
                                    {gen_server,call,
                                     ['index_status_keeper-index',
                                      get_indexes_version]}},
                                   {gen_server,call,
                                    [<0.1288.5>,
                                     #Fun<menelaus_web_cache.2.70484883>,
                                     infinity]}}},

[ns_server:error,2018-04-24T15:26:46.844-07:00,ns_1@127.0.0.1:ns_doctor<0.320.0>:ns_doctor:update_status:308]The following buckets became not ready on node 'ns_1@127.0.0.1': ["Nuance"], those of them are active ["Nuance"]
[ns_server:error,2018-04-24T15:33:42.596-07:00,ns_1@127.0.0.1:capi_ddoc_replication_srv-Nuance<0.541.0>:ns_couchdb_api:wait_for_doc_manager:307]Waited 10000 ms for doc manager pid to no avail. Crash.
[ns_server:error,2018-04-24T15:33:42.616-07:00,ns_1@127.0.0.1:capi_doc_replicator-Nuance<0.540.0>:ns_couchdb_api:wait_for_doc_manager:307]Waited 10000 ms for doc manager pid to no avail. Crash.

I’m a bit unclear. You’re saying it randomly becomes unavailable and you see those messages? Then later you fix it by deleting it and recreating it?

Or are you saying it randomly becomes unavailable and when deleting and recreating the bucket, you see those messages?

Note that bucket delete/recreate propagates asynchronously through the cluster to things, so some messages about not being able to connect, while not great, may happen and things should recover with a little bit of time after bucket creation. Effectively, bucket deletion/creation isn’t expected to be a short period of time kind of thing.

Bucket randomly becomes unavailable, our service search against Couchbase fails and I see error in our log {“code”:12003,“msg”:"Keyspace not found keyspace Nuance…}.
We do not monitor Couchbase logs. I tried to examine those today hoping to see event which triggered the failure…
Before our support techs did reboot the server and it usually fixed the issue. At some point reboot did not help, so they reverted to stop our service, delete and create bucket again and this fixed the issue about 3 times already.
Our setup for this customer is pretty simple: one Couchbase server per our service running on the same box, so basically there is no cluster per say…
If I browse through 10 query.log files I see hundreds of errors similar to:

2018-04-25T12:25:55.725-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
2018-04-24T01:22:43.880-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
2018-04-23T17:40:17.844-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.

@ingenthr Hi Matt, any suggestions on what to look for in Couchbase logs to track down the issue? or maybe you or your team could review logs I got? thx

Apologies for the long delay in reply, I’ve been traveling a bit. I’d start with the error.log and then maybe debug.log (see the docs for the location on your platform). Possible theory: a bug in cbauth in 4.6.1 that has since been fixed?

If you have an enterprise subscription, you may want to contact Couchbase support to have a look at the logs. Log analysis is usually iterative and may have to look at a few components.

I also searched the issues and found that this behavior can be observed owing to defects in prepared statements, fixed in 4.6.4. See MB-26075. Based on that finding, you probably should upgrade to 4.6.4 before doing any more searching.

This exception message says you’re trying to connect to the same host ( 127.0.0.1 ), while you’re stating that your server is running on a different host. This 127.0.0.1 represents a ‘loopback’ . It allows the computer to communicate with itself via network protocol .

This error is a network-related error occurred while establishing a connection to the Server. It means that the error is occurring because there is no server listening at the hostname and port you assigned. It literally means that the machine exists but that it has no services listening on the specified port . So, no connection can be established. Generally, it happens that something is preventing a connection to the port or hostname. Either there is a firewall blocking the connection or the process that is hosting the service is not listening on that specific port. This may be because it is not running at all or because it is listening on a different port. So, no connection can be established.

Try running netstat -anb from the command line to see if there’s anything listening on the port you were entered. If you get nothing, try changing your port number and see if that works for you. In Windows operating systems, you can use the netstat services via the command line (cmd.exe) . On Linux you may need to do netstat -anp instead.

The target machine actively refused it occasionally , it is likely because the server has a full ‘backlog’ . Regardless of whether you can increase the server backlog , you do need retry logic in your client code, sometimes it cope with this issue; as even with a long backlog the server might be receiving lots of other requests on that port at that time.