Auto rebalance after node failure

jfirmino · August 8, 2012, 11:06pm

Hey guys,
I’m having a little problem with my cluster, and I can’t figure out how to solve this. Been reading but nothing so far.
I have a cluster with 5 nodes and 6 buckets, each bucket has 2 replicas.
The problem is, when one of the nodes fails over for whatever reason, I have an outage on the website. It only comes back online after I rebalance the cluster.(And the node get’s removed)
Do you guys know of any option or a way to auto rebalance after a failover?
Cheers,
Jason

mikew · August 9, 2012, 11:05pm

First off, you must enable auto-failover on your cluster for it to work and it is disabled by default. Auto-failover also does not take care of rebalancing the cluster for you after a node has failed over, but this should not have anything to do with data being unavailable. Once you fail a node over the data should be available instantly on the other nodes. For more information on how auto-failover works please see this wiki page http://www.couchbase.com/wiki/display/couchbase/Autofailover+behavior. If you still have any other questions let me know.

jfirmino · August 11, 2012, 11:02pm

Hi Mikew,
Thanks for your reply.
auto-failover is enabled that’s not the point. And I know it doesn’t take care of the rebalance. But that is exactly what I want…
I can script a rebalance after the auto-failover, but I wanted to know if there was any feature that would do that.
Because believe it or not, that’s what happens if I don’t rebalance the cluster, I lose the website because I am caching content. Not only sessions…
Cheers,
Jason

dmorfin · August 21, 2012, 11:00pm

I’ve asked the team to go back and double check the procedure they were testing to be sure that it really was failing differently on the 1.8 cluster.
Is multiple auto-failover targeted for the 2.0 release, or is it post 2.0?

dmorfin · August 21, 2012, 11:00pm

We’re seeing the same behavior on our 1.8 cluster (4 nodes, 3 replicas). We have auto-failover enabled, but when a node fails, the data becomes completely unavailable.
In our 1.7 cluster (8 nodes, 1 replica) we are not seeing that behavior. I don’t know if this is something that broke in 1.8, or if this is something that breaks if there is more than one replica.
As a side note, it would be lovely if anyone knows where in the code I would need to change to “break” auto-failover’s one-node-only “feature”, so that having multiple replicas was more immediately useful.

mikew · August 21, 2012, 11:00pm

It will be a post 2.0 release.

mikew · August 21, 2012, 11:00pm

In certain cases it may take up to 2 minutes for the cluster to decide a node needs to be failed over. Is this what you are seeing or are you seeing the data never becomes unavailable. I only ask because we have not seen this issue in our QE regression tests.
Also, there is no code to break the one node-only feature and I wouldn’t recommend doing so even if you could. Doing auto-failover is very tricky and algorithms need to be very conservative in order to make sure bad things don’t happen. An implementation that is done incorrectly can lead to data loss or a cluster failing over all of its nodes. In the future we will allow auto-failover to work for as many nodes as you have replicas, but for now we only support auto-failover with one node.

dipti · August 26, 2012, 11:00pm

dmorfin
One of the reasons we have not enabled it so far is to prevent the thundering herd failure.
http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-admin-tasks…
In 1.8.1 onwards, you can use swap rebalance to swap out the bad nod with a good one. Typically customers keep 1 backup server in such cases. Read more about swap rebalance here http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-admin-tasks…
hope this helps.

Norro21 · June 17, 2013, 11:00pm

When one server fails we are seeing no data available with 1.8.1. Is this a bug or a problem with our settings? I thought that if one node goes down we are supposed to just lose caching, client side problem perhaps?

tgrall · June 23, 2013, 11:00pm

Hello Norro21,
Are you using Memcached or Couchbase buckets?
Can you describe your topology, number of nodes, replicas, …
Regards

Norro21 · June 24, 2013, 11:00pm

2 Nodes, couchbase bucket with 1 replica, .net client & .net session state

shalomah · May 17, 2017, 7:58am

Hi,

I face the same issue, that the data is available only after rebalance,
can you please suggest why?

I use three nodes cluster with one bucket and two replicas,

Thanks!

Topic		Replies	Views
Auto-Rebalance after node failover Couchbase Server	3	3023	February 9, 2018
Automatic failover in an environment where any server could die at any time Couchbase Server	1	1285	April 27, 2017
Couchbase HA issues Couchbase Server	2	849	February 15, 2018
Backup my data on a failed node and rebalance stcuk at 0% Couchbase Server	1	1371	August 9, 2016
Question on Recovering cluster Couchbase Server	4	1247	April 7, 2017

Auto rebalance after node failure

Related topics