Cluster stuck in unrecoverable state after server failures
We are using beta4 running on an EC2 CentOS 5.4 x64 server.
The test scenario is that we have four membase servers in a cluster running correctly and we manually terminate 1-3 of them, via the AWS console.
We are now stuck in a state where the remaining server(s) has correctly identified that some of the other servers are down, but we are unable to do anything about it. Clicking Fail Over presents a loading spinner for approximately 5 seconds, before the page reloads. The log shows the following error:
Server error during processing: ["web request failed",
{path,"/controller/failOver"},
{type,exit},
{what,
{{{nodedown,'ns_1@10.223.62.182'},
{gen_server,call,
[{'ns_memcached-default',
'ns_1@10.223.62.182'},
{set_vbucket,1,pending},
30000]}},
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
{failover,'ns_1@10.223.62.106'},
20000]}}},
{trace,
[{gen_fsm,sync_send_event,3},
{ns_cluster_membership,failover,1},
{menelaus_web,handle_failover,1},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}] (repeated 1 times)
Clicking remove servers puts them in pending rebalance, but clicking rebalance also results in failure with the following error in the logs:
Rebalance exited with reason noconnection
(repeated 1 times) ns_orchestrator002 20:47:05 - Tue Oct 5, 2010
Client-side error-report for user "Administrator" on node 'ns_1@10.122.10.120':
User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Got unhandled error: 'undefined' is null or not an object
At: [url]http://174.129.106.43:8080/js/all.js:6238[/url]
Backtrace:
Function: collectBacktraceViaCaller
Args:
I have tried with replication enabled and disabled for the bucket but cannot seem to recover from this state. Obviously a fairly serious problem for us as we cannot have the entire cluster fail due to a single machine failure.
Any ideas? Are we doing something completely wrong here? Thanks in advance.
Thanks for the quick response Perry.
Is my understanding correct that if a server fails catastrophically before an admin is able to fail it over or remove it, that the data bucket is then unrecoverable as it was striped across the now dead servers?
No, I don't think that's the correct understanding.
Failing over a node is precisely meant to be done when a node fails catastrophically and immediately activates the appropriate replicas for the data that is now unavailable. Removal of a node is designed to be done while it is still online (for planned maintenance).
You are correct in that the specific data that was on the failed node is now unavailable but the replica copies of that data are still very much available. When the node comes back in, it is these replicas that are used to rematerialize the data so as to maintain consistency. You can increase the replica count (with the understanding that this takes more RAM and disk space) in order to sustain the loss of more servers.
Hope that clears it up for you, I'll let you know what I find about the specific (and unexpected) issue you ran into.
Perry
Thanks Perry that's good to know. I am able to reproduce the above with 100% consistency which had me very worried, I am glad to hear it is just a bug.
What is the best practice for determining the number of replicas when initially creating a bucket? In beta4 I am unable to change the # of replicas after initial bucket creation, which makes life a little challenging when dealing with a potentially ever-increasing number of nodes in a cluster.
Yes, it is clearly a bug and will be fixed going forward (you can see the bug report here: [url]http://bugs.northscale.com/show_bug.cgi?id=2689[/url])
In terms of judging the number of replicas needed, you'll have to balance a few characteristics and decide what is most important for you:
-How likely are your nodes to fail? And how likely are more than one to fail before an administrator can take some action? Obviously this is different in EC2 than in a datacenter, than in a virtualized data center, etc.
-How much memory and disk space are you willing to use? More replicas means more RAM and disk needed.
In the end: How sensitive are you to the loss of certain pieces of data? I can cause significant or even total data loss with any datastore (Oracle, MySQL, a hard drive, even RAID, etc) if I do enough things to destroy the data and its replicas. No system is 100% safe to extreme failures and so everyone needs to balance the mission criticality of their data with the additional expense of mitigating that risk. We give you the ability to control the level of durability on a per-dataset so that you can make the appropriate judgement and tradeoffs for different types of data.
I know that's probably a bit more vague than you were looking for, but I don't want to give you a hard answer since it really does depend.
Thanks again
Perry
Thanks Perry. Please let me know if I can assist in testing a new build.
I was able to reproduce your issue on the currently released beta 4, but it worked properly in our latest build. Our GA will be released next week and will obviously include this fix.
Thanks so much for your valuable feedback, let me know if there is anything else I can do for you.
Perry
Thanks for the feedback. No, you're not doing anything unusual. I'll try to reproduce this and file a bug if necessary. We've fixed a number of bugs for the upcoming GA release so I will test with both beta 4 and our latest build.
Thanks again for your testing.
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Membase: http://www.membase.com/products-and-services/overview
Call or email "sales -at- membase -dot- com" today!