[MB-4476] adding a node to a cluster which has the same otpCookie causes issues ( happens when vm is cloned or created from an AMI/VM_Template where couchbase was already installed) Created: 25/Nov/11  Updated: 31/Jan/14  Resolved: 05/Apr/12

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.7.2
Fix Version/s: 2.0-beta
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Conor McDermottroe Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: 1.8.1-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Membase 1.7.2 installed via deb on Ubuntu 11.10 64 bit running on an Amazon EC2 m2.xlarge instance.

Attachments: File membase_a_log.json     Zip Archive membase_a_mbcollect_info.zip     Text File membase_a_ns-diag-20111130185447.txt     File membase_b_log.json     Zip Archive membase_b_mbcollect_info.zip     Text File membase_b_ns-diag-20111130185430.txt     File membase_c_log.json    

 Description   
I'm seeing intermittent problems when building a cluster of Membase servers. Sometimes adding an additional server to a cluster will cause another server to be marked as unhealthy. Other times the additional server will function correctly but all others will appear to be in setup mode. This only happens some times, so it's tricky to nail down properly.

Steps to reproduce this are in this forum thread: http://www.couchbase.org/forums/thread/server-marked-unhealthy-after-adding-additional-server-cluster

In the most recent test run, I spun up 3 m2.xlarge instances as described in the forum post. On adding the third server to the cluster (from the admin web interface on the first server), the admin interface refreshed and showed the setup dialog. When I browsed to the second server it was also in setup mode. The third server was configured correctly and appeared to be running in its own cluster of one machine. I've attached the output from /logs on all three machines.

(Forgive me if I have the wrong component selected, I'm not quite sure which one is causing me to see this issue.)

 Comments   
Comment by Farshid Ghods (Inactive) [ 30/Nov/11 ]
if possible please attach the diags from the existing node and the node you added to the cluster.
Comment by Conor McDermottroe [ 30/Nov/11 ]
Sorry, I should have described the attachments a little better.

membase_a_log.json is the output from /log on the machine where I ran the setup via the web console.

membase_b_log.json is the output from /log on the first machine which I added to the cluster. (After which the cluster was a functioning two machine cluster)

membase_c_log.json is the output from /log on the second machine which I added to the cluster. (After which the cluster was broken)

Are there other diagnostics which would be useful? I can re-run the test if necessary.
Comment by Conor McDermottroe [ 30/Nov/11 ]
I've replicated the issue again, this time with two machines.

I ran setup on A, which resulted in an apparently OK 1 node "cluster".

I then added B to A which resulted in A being in setup mode and B being in a 1 node "cluster".

I've attached the output of /diag and mbcollect_info from both machines.
Comment by Conor McDermottroe [ 30/Nov/11 ]
Oh, and sorry for the naming. Just in case it's not 100% clear, the A and B in the second test with the output of /diag and mbcollect_info are *not* the same as the A and B in the first test.
Comment by Conor McDermottroe [ 06/Dec/11 ]
Was the information I attached above any use? I can re-run and gather additional information if you need.
Comment by Aleksey Kondratenko [ 06/Dec/11 ]
It is useful, thanks a lot for taking time to get it and create ticket.
Comment by Matt Ingenthron [ 08/Dec/11 ]
Adding to 1.8.0 fixfor. Per discussion today, QE will look into this and determine if it still belongs slotted on 1.8 and triage priority/severity.
Comment by Aleksey Kondratenko [ 09/Dec/11 ]
found whats happened. Thanks, Conor, again very much for reporting it.

Something interesting happened. Both nodes have same initial erlang cookie. And that's causing them to communicate too early in join process. And node that's joining another node gets config from node that's being joined, sees config conflict and picks 'wrong' version. That's causing nodes_wanted with only new node (just joined), this causes original cluster node to leave cluster.

This is very interesting and we haven't seen this before. Have you cloned VM ? Looks like this is possible in EC2 via custom images. And logs kind of confirm that. There is time jump of 8 days before last start of node.

If not that could be due to nodes being launched at same time and not high clock resolution of EC2. Because initial cookie is generated by RNG, but rng is seeded with clock. Erlang itself has microsecond clock precision, but underlying kernel (and in case of Xen, underlying supervisor or Dom0 kernel) does not necessarily supports that. But that seems _very_ unlikely, so I bet on cloning.

In order to correctly fix this issue I need you to confirm that you cloned your VM (or not).

Meanwhile, the following command can be used to re-init cookie of node (don't do that on nodes that are joined to cluster):

wget -O- --post-data='NewCookie = ns_node_disco:cookie_gen(), ns_config:set(otp, [{cookie, NewCookie}]).' --user=Administrator --password=asdasd http://lh:9000/diag/eval

replace password with your admin password and host:port with your rest host:port (8091 is default port). Doing it on any of nodes prior to joining will likely fix your problem.
Comment by Conor McDermottroe [ 09/Dec/11 ]
Thanks all for chasing this down.

I created the AMI from a running instance after installing Membase from the deb, so I guess that's the issue.

I'm going to test re-initializing the cookie after launch but before adding it to the cluster and see if that fixes the issue. I'll report back here with the results.
Comment by Conor McDermottroe [ 09/Dec/11 ]
If I re-initialize the Erlang cookie before adding a machine to the cluster I can't replicate the error.

Looks good so far, thanks!
Comment by Aleksey Kondratenko [ 07/Feb/12 ]
Good one to learn our clustering.

Here we need to change node's uuid and cookie prior to joining. /engage.. seems like right place.

This is against 1.8.
Comment by Aleksey Kondratenko [ 20/Mar/12 ]
http://review.couchbase.org/14062
Comment by Thuan Nguyen [ 22/Mar/12 ]
Integrated in github-ns-server-2-0 #321 (See [http://qa.hq.northscale.net/job/github-ns-server-2-0/321/])
    reset node's cookie before joining cluster.MB-4476 (Revision 9fbf7871720192010e65e1b5610dece9383a0f30)

     Result = SUCCESS
Aliaksey Kandratsenka :
Files :
* src/ns_cluster.erl
Comment by Aleksey Kondratenko [ 05/Apr/12 ]
Actually fixed for 1.8.1 here: http://review.couchbase.org/14186
Comment by Aleksey Kondratenko [ 05/Apr/12 ]
done
Generated at Thu Jul 10 12:48:13 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.