Timeouts on query after hard failover

chris_miller · April 15, 2016, 8:59pm

Hello,

We’ve been testing HA configurations and recovery using 4.0.0-4051 Community Edition. I am using a three-node cluster with one replica copy on each bucket, view index replicas enabled, and auto-failover set to 2 minutes. I took one node down hard. Immediately following the auto-failover timeout, the remainder of the cluster started what looked like a full re-index of several views. During that indexing, requests against the server (using the couchbase-1.3.12 ruby gem) would throw timeout errors.

Couchbase::Error::Timeout error=failed to execute HTTP request, Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout

Since this is in a test environment, each remaining server has plenty of capacity (as shown in the attached screenshot). Given the replication configuration including view index replicas, why would we be unable to perform queries against the cluster in this situation? Why are we suffering any kind of timeouts once the failover of the downed node has begun?

It took a full 18 minutes for us to have restored functionality, essentially once it had finished reindexing.

househippo · April 17, 2016, 6:00pm

Query definitions are not currently auto replicated so you have to do it manually put the same index on different machines.

Its easy though EXAMPLE

CREATE INDEX email1 ON bucket_name(email) USING GSI WITH {"nodes": [ "7.7.7.7" ]};

CREATE INDEX email2 ON bucket_name(email) USING GSI WITH {"nodes": [ "8.8.8.8" ]};

chris_miller · April 18, 2016, 12:57pm

Specifically what I’m referring to are the classic map/reduce views, which we make extensive use of.

househippo · April 19, 2016, 6:26am

@chris_miller,

When you first created the bucket you should have check

View index replicas

once the bucket is created you cannot , recheck it.

I always check it for APPs teams b/c when and if they need views in PROD its ready for them.

If you can just XDCR to another bucket in the same cluster with a different name and this time with it checked.

NOTE this does mean that you’ll have more disk i/o and cpu usage as you are double the number of views. so make sure you have enough CPU and Disk i/o

chris_miller · April 19, 2016, 2:48pm

@househippo,

Thank you for the advice. I was pretty sure view replicas were already set, but you made me go back and confirm. In fact we do have view replicas set for all buckets.

househippo · April 19, 2016, 4:28pm

Sorry I see that you had replica views enabled from before.

Did you install the couchbase as ROOT?

Could you login as the couchbase user and find the ulimits?

ulimit -a

and take a screen shot and share

&

Can you check if the the current running process of couchbase has correct ulimits too.
find the PID of the memcached.bin process or any of the beam.smp processes.

cat /proc/{PID}/limits

and take a screen shot and share

Topic		Replies	Views
Behavior of View Queries During Failover/Rebalance Couchbase Server query	3	2253	July 19, 2016
Map-reduce views not available during failover Couchbase Server	21	3559	August 7, 2018
Why does a data node failover cause a query timeout? Couchbase Server query , connections	13	145	December 8, 2024
Question for connection timeout Java SDK	3	2308	September 19, 2016
Views access time out during rebalance Couchbase Server	3	2030	March 14, 2017

Timeouts on query after hard failover

Related topics