getIndexStatus failed

Hi all

####Error description:
I’ve a strange “error” in my error logs on the couchservers. It says every 2 minutes (sometimes 3 and sometimes 1 minte) that the node is not able to retrieve the IndexStatus. Here are the original log entries:

[ns_server:error,2016-05-10T10:34:03.682+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:36:07.851+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:37:57.827+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:38:12.829+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}

When I curl this url by myself I get from time to time (no really regularity) that it fails to retreive the cluster-wide metadata from index service.
Here the error.

{"code":"error","error":"Fail to retrieve cluster-wide metadata from index service","failedNodes":["SERVERIP:9102"]}

Additionally I found out that this happens everytime when the connection of the index service gets closed.
####My Question: Is this behavior normal or is there something wrong with my cluster?
####The Cluster: 4 Physical Servers (Dell PowerEdge R730) Index, Data and Query service on each node 4.0.0-4051 Community Edition (build-4051) 3 Buckets (1 High, 1 Low and 1 Memcache)

####The Server:
Dell PowerEdge R730
115GB RAM (Couchbase Quota)
1 dedicated Indexdisk (SSD)
RAID 1 Data disk with 7200rpm

####Network:
1 dedicated network interface for communication between nodes
1 dedicated network interface for external access (SDK, Webgui, etc)

####OS:
Linux SERVERNAME 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux

####High Bucket:
~2’000 ops/sec
~700’000’000 Items

####Low Bucket:
~1’000 ops/sec
~60’000’000 Items

####Memcached Bucket:
~30’0000 ops/sec
~2’500’000 Items

####Load:
We have usual Webtraffic which causes the load on your couchbase cluster. (Night nearly nothing and during the Day high Peaks at the morning and evening)

####Misc:
The data disks have high I/O during the day
Avg. Age of Items are between 6000 sec and 0 sec (depends on node and on hour)
Disk queue has during the day around 600’000 items during the day

Hi Mathias, this is not normal. Do you see any other errors in the logs?

Hi Prasad

Thanks for the reply :slight_smile:

Here are some logs.
logs.zip (45.0 KB)

Do you need more or is this enough?

Hi @mathias - did you find a solution for your problem? I’ve recently come across the same error message in one of my environments.

Hi @peda
No sorry @prasad didn’t answer me yet.

Hi @peda

Have you found a solution yet?

Hi @mathias
No unfortunately not - these errors are still showing up quit frequently in my log files even though I’ve seen a little improvement since the upgrade to Couchbase 4.1 CE.

@peda
have you this error in your webconsole too:
Haven’t heard from a higher priority node or a master, so I’m taking over.
?

@mathias Yes, I do have these errors too. But this issue improved since I’ve added static entries to /etc/hosts it looks like Couchbase DNS resolution timeout is very rigid. If the DNS server is in a different network (like in any cloud environment) the domain name resolution times out from time to time. Even with average DNS response times less than 50 ms these errors keep coming up.

@peda
I’ve static /etc/hosts entries since the beginning of our cluster because the internal cluster traffic goes over a different interface. this error is realy strange…

@mathias the messages with “Haven’t heard from a higher priority node or a master” and “IP address seems to have changed.” are occuring less frequently since I’m using static /etc/hosts entries. My setup does also use a different network interface (with internal IP addresses) for cluster internal traffic.

@mathias, @peda,
i’ve got the same for 4.1.X (3 nodes):

[ns_server:error,2016-08-19T14:05:43.819Z,ns_1@host.internal:index_status_keeper_worker<0.1333.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}

The resulting effect is that view returns nothing via Java SDK. View rebuild (actually, DD-recreation with view) helped to solve the problem.

My entries for internal hosts ip’s are all static (i mean dns resolving via /etc/hosts) and i don’t see any messages about lost nodes.
@prasad, any comments / ideas ?

@egrep, @peda, @prasad

Hi guys

Does anybody of you have any information or progress about this topic?
I still get the logs flooded with those entries. They happen now every 4 minutes…

Little status update:

I’m now not able anymore to add another server into the cluster because the rebalance fails with a timeout and the server gets autofailovered…

Hi @mathias, sorry missed it. Have pinged indexer team. Will keep you posted, or someone will respond…

Hi @prasad

Thanks for the response. Thanks for your help

@mathias, regarding this specific error message, it means the cluster manager (ns_server) is not able to reach Indexer on port 9102. It could be due to Indexer service being down or network issues in the cluster(looks more likely).

Can you share your logs after changing the log level to Debug from UI Settings? The log files earlier shared in this post are incomplete and not very useful.

@deepkaran.salooja thanks for the reply

I tested all ports with telnet and netcat which are listen in the docs and all servers are able to connect to them to the others.

I started the log connection. Do you have access to the s3.amazonaws.com/cb-customers share which is mentioned here Working with couchbase support ?

@mathias, can you please upload it to s3-us-west-1.amazonaws.com/forumlogs

@deepkaran.salooja sent you a pm with the info where I uploaded it