Auto failover takes about 30 seconds, which is a lifetime when nodes are on the same switch. What are the repercussions of a lower failover timeout? Has anyone ever looked into this?
I guess you are talking about the automatic failover option in the console: it's 30 minutes, not 30 seconds.
It is not a setting meant for rebalancing requests if a server is busy or down: according to both my tests and what is written in the manual, this kind of "failover" is automatic and transparent if you have a replica factor > 1 for the bucket.
The setting you are talking about is instead a timeout before "definitively removing" the server from the cluster, in order to redistribute data and stricly match the replica factor again.
If this kind of failover happens and the server comes back online, it has to download the data again before being able to process requests (and of course has to be manually added to the cluster again).
As the manual suggests, it's better to leave this disabled specially (I add) if you have a replica factor of 1/2 and less than 5/10 servers in the cluster, otherwise you risk to completely delete your data in case e.g. of an unexpected load increase.
Btw I'm pretty new to couchbase and would wait an "official" reply from them ;).
Just to correct the answer above, the option in the console is in seconds (30second by default and the minimal value), and the documentation explains why:
- it is disabled by default
- why the minimal value is 30s
As Frances said the Couchbase document explicitly explains why it is not necessary a good idea to use auto fail over, take a look to:
Honestly 1(Thundering Herd) and 2(Network Partitioning) are bad planning. If you are running at 80-90% network capacity, a surge in traffic could take you down. Network partitioning can also be handled if you are building a redundant resilient system.
My problem here is 30 seconds is a lifetime in the web world, and by the time 30 seconds rolls around you have lost those users, possibly forever. And it seems the way couchbase works that you cannot get a copy of the key data, and cannot overwrite the data until the node has been failed over.
30 seconds seem arbitrary, and may make sense when nodes are scattered all around AWS, but when nodes are lined up on a single switch... that is another story.
Sorry, I thought couchbase worked almost like cassandra but it seems it's not the case.
I made this test: even if I have all the cluster correctly rebalanced with a replica factor of 1, when I gracefully stop one server I am not able to read 1/4 of the keys in my 4 servers test installation.
This suggests me the idea of using two cluster synchronized with XDCR replication.
When both the clusters have all the servers up, I distribute the requests on both of them using a consistent hashing.
When one of the cluster have a faulty servers, I can route all the requests or only the supposed faulting ones to the full "up-and-running-cluster".
Does it make sense?
© 2013 COUCHBASE All rights reserved.