getIndexStatus failed

mathias · May 10, 2016, 9:04am

Hi all

####Error description:
I’ve a strange “error” in my error logs on the couchservers. It says every 2 minutes (sometimes 3 and sometimes 1 minte) that the node is not able to retrieve the IndexStatus. Here are the original log entries:

[ns_server:error,2016-05-10T10:34:03.682+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:36:07.851+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:37:57.827+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}
[ns_server:error,2016-05-10T10:38:12.829+02:00,ns_1@SERVERIP:index_status_keeper_worker<0.363.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}

When I curl this url by myself I get from time to time (no really regularity) that it fails to retreive the cluster-wide metadata from index service.
Here the error.

{"code":"error","error":"Fail to retrieve cluster-wide metadata from index service","failedNodes":["SERVERIP:9102"]}

Additionally I found out that this happens everytime when the connection of the index service gets closed.
####My Question: Is this behavior normal or is there something wrong with my cluster?
####The Cluster: 4 Physical Servers (Dell PowerEdge R730) Index, Data and Query service on each node 4.0.0-4051 Community Edition (build-4051) 3 Buckets (1 High, 1 Low and 1 Memcache)

####The Server:
Dell PowerEdge R730
115GB RAM (Couchbase Quota)
1 dedicated Indexdisk (SSD)
RAID 1 Data disk with 7200rpm

####Network:
1 dedicated network interface for communication between nodes
1 dedicated network interface for external access (SDK, Webgui, etc)

####OS:
Linux SERVERNAME 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux

####High Bucket:
~2’000 ops/sec
~700’000’000 Items

####Low Bucket:
~1’000 ops/sec
~60’000’000 Items

####Memcached Bucket:
~30’0000 ops/sec
~2’500’000 Items

####Load:
We have usual Webtraffic which causes the load on your couchbase cluster. (Night nearly nothing and during the Day high Peaks at the morning and evening)

####Misc:
The data disks have high I/O during the day
Avg. Age of Items are between 6000 sec and 0 sec (depends on node and on hour)
Disk queue has during the day around 600’000 items during the day

prasad · May 12, 2016, 10:21pm

Hi Mathias, this is not normal. Do you see any other errors in the logs?

mathias · May 13, 2016, 10:28am

Hi Prasad

Thanks for the reply

Here are some logs.
logs.zip (45.0 KB)

Do you need more or is this enough?

peda · July 19, 2016, 3:49pm

Hi @mathias - did you find a solution for your problem? I’ve recently come across the same error message in one of my environments.

mathias · July 19, 2016, 4:10pm

Hi @peda
No sorry @prasad didn’t answer me yet.

mathias · August 9, 2016, 1:53pm

Hi @peda

Have you found a solution yet?

peda · August 9, 2016, 2:21pm

Hi @mathias
No unfortunately not - these errors are still showing up quit frequently in my log files even though I’ve seen a little improvement since the upgrade to Couchbase 4.1 CE.

mathias · August 9, 2016, 2:33pm

@peda
have you this error in your webconsole too:
Haven’t heard from a higher priority node or a master, so I’m taking over.
?

peda · August 9, 2016, 2:47pm

@mathias Yes, I do have these errors too. But this issue improved since I’ve added static entries to /etc/hosts it looks like Couchbase DNS resolution timeout is very rigid. If the DNS server is in a different network (like in any cloud environment) the domain name resolution times out from time to time. Even with average DNS response times less than 50 ms these errors keep coming up.

mathias · August 10, 2016, 7:59am

@peda
I’ve static /etc/hosts entries since the beginning of our cluster because the internal cluster traffic goes over a different interface. this error is realy strange…

peda · August 10, 2016, 9:04am

@mathias the messages with “Haven’t heard from a higher priority node or a master” and “IP address seems to have changed.” are occuring less frequently since I’m using static /etc/hosts entries. My setup does also use a different network interface (with internal IP addresses) for cluster internal traffic.

egrep · August 19, 2016, 4:12pm

@mathias, @peda,
i’ve got the same for 4.1.X (3 nodes):

[ns_server:error,2016-08-19T14:05:43.819Z,ns_1@host.internal:index_status_keeper_worker<0.1333.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/getIndexStatus failed: {error,timeout}

The resulting effect is that view returns nothing via Java SDK. View rebuild (actually, DD-recreation with view) helped to solve the problem.

My entries for internal hosts ip’s are all static (i mean dns resolving via /etc/hosts) and i don’t see any messages about lost nodes.
@prasad, any comments / ideas ?

mathias · October 4, 2016, 8:48am

@egrep, @peda, @prasad

Hi guys

Does anybody of you have any information or progress about this topic?
I still get the logs flooded with those entries. They happen now every 4 minutes…

mathias · October 4, 2016, 9:47am

Little status update:

I’m now not able anymore to add another server into the cluster because the rebalance fails with a timeout and the server gets autofailovered…

prasad · October 5, 2016, 9:53pm

Hi @mathias, sorry missed it. Have pinged indexer team. Will keep you posted, or someone will respond…

mathias · October 5, 2016, 10:01pm

Hi @prasad

Thanks for the response. Thanks for your help

deepkaran.salooja · October 5, 2016, 10:49pm

@mathias, regarding this specific error message, it means the cluster manager (ns_server) is not able to reach Indexer on port 9102. It could be due to Indexer service being down or network issues in the cluster(looks more likely).

Can you share your logs after changing the log level to Debug from UI Settings? The log files earlier shared in this post are incomplete and not very useful.

mathias · October 5, 2016, 11:03pm

@deepkaran.salooja thanks for the reply

I tested all ports with telnet and netcat which are listen in the docs and all servers are able to connect to them to the others.

I started the log connection. Do you have access to the s3.amazonaws.com/cb-customers share which is mentioned here Working with couchbase support ?

deepkaran.salooja · October 6, 2016, 12:38am

@mathias, can you please upload it to s3-us-west-1.amazonaws.com/forumlogs

mathias · October 6, 2016, 9:55am

@deepkaran.salooja sent you a pm with the info where I uploaded it

Topic		Replies	Views
getIndexStatus failed: {error,timeout} Couchbase Server	0	1409	January 5, 2017
getIndexStatus failed error Community	6	3104	December 11, 2023
Error.log Errors that come up all the time Couchbase Server	1	1630	August 4, 2017
Bucket Analytics Hickup Couchbase Server server	2	1322	August 8, 2016
Index node hangs with error Couchbase Server	1	653	October 1, 2019

getIndexStatus failed

Related topics