Server marked as unhealthy after adding an additional server to a cluster
I'm seeing intermittent problems when building a cluster of Membase servers. Sometimes adding an additional server to a cluster will cause another server to be marked as unhealthy. This only happens some times, so it's tricky to nail down properly. The best clue I have so far is an entry like this in the logs:
Conflicting configuration changes to field nodes_wanted:[{'_vclock',[{'ns_1@10.0.0.3',{1,63489371446}}]},'ns_1@10.0.0.3'] and [{'_vclock',[{'ns_1@10.0.0.1',{3,63489371446}}]}, 'ns_1@10.0.0.1','ns_1@10.0.0.3','ns_1@10.0.0.2'], choosing the former.
It looks like some sort of a tie-break scenario and sometimes it picks the wrong nodes to include in the configuration.
Does anyone have any good ideas as to what might be causing this?
Detailed steps to reproduce are below:
Steps to reproduce
Create an AMI
Create the Security Group
Create a security group with the following allow rules:
Launch the Membase machines
Launch 3 m2.xlarge instances with the AMI and Security Group you created above.
For the rest of this document public1, public2 and public3 are the AWS public hostnames for the instances. 10.0.0.1, 10.0.0.2 and 10.0.0.3 are the private IPs.
Configure the Membase cluster
Start adding machines
In the Membase Web Console, add the first machine:
Add the second machine:
Observe the failure
On some occasions the second additional machine (10.0.0.3) will fail to add cleanly. In this circumstance, you'll see 10.0.0.2 marked as "Down" in the server list and 10.0.0.3 listed under "Pending Rebalance".
I believe this particular log entry is the root cause (or at least a symptom of it):
Conflicting configuration changes to field nodes_wanted:[{'_vclock',[{'ns_1@10.0.0.3',{1,63489371446}}]},'ns_1@10.0.0.3'] and [{'_vclock',[{'ns_1@10.0.0.1',{3,63489371446}}]}, 'ns_1@10.0.0.1','ns_1@10.0.0.3','ns_1@10.0.0.2'], choosing the former.
Hi, thanks for the response.
We have an automated system for adding and removing nodes but for the purposes of isolating this issue I have replicated it by hand using the steps above.
Even when doing an automated build of a cluster we only ever make changes to one node at a time and after we started seeing issues I inserted sleeps between API calls to see if I could avoid timing issues.
If it's any further help, we were seeing this issue after a call to /controller/addNode in our automated system. The issue still occurs when doing things manually though.
Yes, thanks for all of the additional information.
I've checked around a bit and the initial reaction is that the message you observe there is benign. Something else may be leading to it though. Can you get diags from all of the nodes and either put them somewhere for us to download or attach them to an issue on www.couchbase.org/issues? (note: same login works there as for the forums)
I re-ran the test and got an apparently related problem (slightly different symptoms). The details are filed in this issue: http://www.couchbase.org/issues/browse/MB-4476
Thanks again for your help.
Thank you for all the diligence on this one. I've already been in contact with some folks and passed along the issue. We should both see updates from there.
Excellent, thanks very much. :)
Is this some kind of automated adding/removing of nodes?
The cluster configuration is spread out across all nodes. To be able to determine the latest changes, we use vector clocks on the configuration. Without getting into all of the details, this should allow you to make changes on multiple nodes, even concurrently. If the same part of the config is changed in two locations at the same time though, there can be a conflict and the cluster will try to resolve the conflict.
So, question is, are you making lots of changes to the configuration concurrently and from multiple nodes? This is the only thing I can think of at the moment that'd cause the scenario you're describing. I'll see if I can get some other input on the info you'v posted.
Thanks for all of the detailed info.