Auto rebalance after node failure
Hey guys,
I'm having a little problem with my cluster, and I can't figure out how to solve this. Been reading but nothing so far.
I have a cluster with 5 nodes and 6 buckets, each bucket has 2 replicas.
The problem is, when one of the nodes fails over for whatever reason, I have an outage on the website. It only comes back online after I rebalance the cluster.(And the node get's removed)
Do you guys know of any option or a way to auto rebalance after a failover?
Cheers,
Jason
Hi Mikew,
Thanks for your reply.
auto-failover is enabled that's not the point. And I know it doesn't take care of the rebalance. But that is exactly what I want...
I can script a rebalance after the auto-failover, but I wanted to know if there was any feature that would do that.
Because believe it or not, that's what happens if I don't rebalance the cluster, I lose the website because I am caching content. Not only sessions...
Cheers,
Jason
We're seeing the same behavior on our 1.8 cluster (4 nodes, 3 replicas). We have auto-failover enabled, but when a node fails, the data becomes completely unavailable.
In our 1.7 cluster (8 nodes, 1 replica) we are not seeing that behavior. I don't know if this is something that broke in 1.8, or if this is something that breaks if there is more than one replica.
As a side note, it would be lovely if anyone knows where in the code I would need to change to "break" auto-failover's one-node-only "feature", so that having multiple replicas was more immediately useful.
In certain cases it may take up to 2 minutes for the cluster to decide a node needs to be failed over. Is this what you are seeing or are you seeing the data never becomes unavailable. I only ask because we have not seen this issue in our QE regression tests.
Also, there is no code to break the one node-only feature and I wouldn't recommend doing so even if you could. Doing auto-failover is very tricky and algorithms need to be very conservative in order to make sure bad things don't happen. An implementation that is done incorrectly can lead to data loss or a cluster failing over all of its nodes. In the future we will allow auto-failover to work for as many nodes as you have replicas, but for now we only support auto-failover with one node.
I've asked the team to go back and double check the procedure they were testing to be sure that it really was failing differently on the 1.8 cluster.
Is multiple auto-failover targeted for the 2.0 release, or is it post 2.0?
It will be a post 2.0 release.
dmorfin
One of the reasons we have not enabled it so far is to prevent the thundering herd failure.
http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-admin-tasks...
In 1.8.1 onwards, you can use swap rebalance to swap out the bad nod with a good one. Typically customers keep 1 backup server in such cases. Read more about swap rebalance here http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-admin-tasks...
hope this helps.
First off, you must enable auto-failover on your cluster for it to work and it is disabled by default. Auto-failover also does not take care of rebalancing the cluster for you after a node has failed over, but this should not have anything to do with data being unavailable. Once you fail a node over the data should be available instantly on the other nodes. For more information on how auto-failover works please see this wiki page http://www.couchbase.com/wiki/display/couchbase/Autofailover+behavior. If you still have any other questions let me know.