Node failure management
Hey there.
I would like some explanation about nodes failures scenario.
Here is our scenario:
We are using the .NET client to connect to NS Cluster and retrieve the active nodes (REST).
We are running a cluster with 6 nodes.
During some server maintenance this morning we took out of the cluster one of the nodes and received some timeouts errors.
The question is: What is the 'best practice' to manage the single nodes failures?
When some of the nodes goes down (timeout occurred), we should instruct out NS client to retrieve the active nodes again or is there some way to do it automatically?
Regards
Thank you Perry for your response.
As always very punctual.
Seems that our problem may be the outdated Enyim client version.
I've seen that the feature "detect offline nodes" has been introduced in the latest release so we will update our clients and see how it handles the failures.
I will report back here.
I hope that the issue will be fixed because now seems that we have a single point of failure and that is not acceptable.
Also, do you think that increasing the connection timeout in the Enyim configuration can help to prevent errors while the client is pooling the new node in case some other failed?
Thanks again.
While increasing the connection timeout will prevent errors to the client, it will also increase the amount of time that a client is potentially hanging waiting for data. In the case of memcached (as opposed to Membase) I think it is more desirable to have the client "miss" the data and go about regenerating it so that the user can continue rather than waiting. It's a balance that depends on the application and its behavior.
Perry
Thanks for the inquiry. What version of the Enyim client are you using? There have been some recent changes to improve the behavior around handling of failures.
At a high level, the behavior should be that when a node goes down, there will be a brief timeout period (configurable, 10 seconds by default I believe) until the failure is detected by the client. It should then mark that node as dead and continue operations with the remaining servers. The client will continue to periodically poll that dead server and will add it back into its pool when it can connect to it again.
In complete transparency, we have also seen some server-side issues and have already fixed them in the soon-to-be-released 1.6.0 version.
Thanks again, please let me know what else I can do to help.
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Membase: http://www.membase.com/products-and-services/overview
Call or email "sales -at- membase -dot- com" today!