Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.7.x

Server marked as unhealthy after adding an additional server to a cluster

6 replies [Last post]
  • Login or register to post comments
Thu, 11/24/2011 - 10:41
Conor
Offline
Joined: 11/22/2011
Groups: None

I'm seeing intermittent problems when building a cluster of Membase servers. Sometimes adding an additional server to a cluster will cause another server to be marked as unhealthy. This only happens some times, so it's tricky to nail down properly. The best clue I have so far is an entry like this in the logs:

Conflicting configuration changes to field nodes_wanted:[{'_vclock',[{'ns_1@10.0.0.3',{1,63489371446}}]},'ns_1@10.0.0.3'] and [{'_vclock',[{'ns_1@10.0.0.1',{3,63489371446}}]}, 'ns_1@10.0.0.1','ns_1@10.0.0.3','ns_1@10.0.0.2'], choosing the former.

It looks like some sort of a tie-break scenario and sometimes it picks the wrong nodes to include in the configuration.

Does anyone have any good ideas as to what might be causing this?

Detailed steps to reproduce are below:

Steps to reproduce

Create an AMI

  • Spin up an instance with an Ubuntu 11.10 64 bit EBS image.
  • Shell into the machine and run the following commands:
  • sudo byobu-disable
  • (Shell in again)
  • sudo aptitude update
  • sudo aptitude safe-upgrade
  • wget http://packages.couchbase.com/releases/1.7.2/membase-server-community_x8...
  • sudo aptitude install libssl0.9.8
  • sudo dpkg -i membase-server-community_x86_64_1.7.2.deb
  • Create an AMI from that image.
  • Create the Security Group

    Create a security group with the following allow rules:

  • TCP ports 1-65535 from this security group.
  • UDP ports 1-65535 from this security group.
  • ICMP All from this security group.
  • TCP port 8091 from 0.0.0.0/0
  • Launch the Membase machines

    Launch 3 m2.xlarge instances with the AMI and Security Group you created above.

    For the rest of this document public1, public2 and public3 are the AWS public hostnames for the instances. 10.0.0.1, 10.0.0.2 and 10.0.0.3 are the private IPs.

    Configure the Membase cluster

  • Go to http://public1:8091/
  • Click SETUP
  • Leave the Path field as-is.
  • Make sure "Start a new cluster" is selected.
  • Lower the "Per Server RAM Quota" to 15000.
  • Click Next
  • "Bucket Type" == "Membase"
  • "Per Node RAM Quota" == 100MB (the minimum)
  • "Enable Replication" -> checked
  • "Number of replica (backup copies)" -> 1
  • Click Next
  • Click Next
  • "Username" -> "Administrator"
  • "Password" -> "test1234"
  • Click Next
  • Start adding machines

    In the Membase Web Console, add the first machine:

  • Click on "Server Nodes" under "Manage"
  • Click "Add Server"
  • "Server IP Address" -> 10.0.0.2
  • "Username" -> "Administrator"
  • "Password" -> "test1234"
  • Click "Add Server"
  • Click "Add Server"
  • Click "Rebalance"
  • Wait for the rebalance to complete.
  • Add the second machine:

  • Click "Add Server"
  • "Server IP Address" -> 10.0.0.3
  • "Username" -> "Administrator"
  • "Password" -> "test1234"
  • Click "Add Server"
  • Click "Add Server"
  • Observe the failure

    On some occasions the second additional machine (10.0.0.3) will fail to add cleanly. In this circumstance, you'll see 10.0.0.2 marked as "Down" in the server list and 10.0.0.3 listed under "Pending Rebalance".

    I believe this particular log entry is the root cause (or at least a symptom of it):

    Conflicting configuration changes to field nodes_wanted:[{'_vclock',[{'ns_1@10.0.0.3',{1,63489371446}}]},'ns_1@10.0.0.3'] and [{'_vclock',[{'ns_1@10.0.0.1',{3,63489371446}}]}, 'ns_1@10.0.0.1','ns_1@10.0.0.3','ns_1@10.0.0.2'], choosing the former.

    Top
    • Login or register to post comments
    Fri, 11/25/2011 - 13:29
    ingenthr
    Offline
    Joined: 03/16/2010
    Groups:

    Is this some kind of automated adding/removing of nodes?

    The cluster configuration is spread out across all nodes. To be able to determine the latest changes, we use vector clocks on the configuration. Without getting into all of the details, this should allow you to make changes on multiple nodes, even concurrently. If the same part of the config is changed in two locations at the same time though, there can be a conflict and the cluster will try to resolve the conflict.

    So, question is, are you making lots of changes to the configuration concurrently and from multiple nodes? This is the only thing I can think of at the moment that'd cause the scenario you're describing. I'll see if I can get some other input on the info you'v posted.

    Thanks for all of the detailed info.

    Top
    • Login or register to post comments
    Fri, 11/25/2011 - 17:21
    Conor
    Offline
    Joined: 11/22/2011
    Groups: None

    Hi, thanks for the response.

    We have an automated system for adding and removing nodes but for the purposes of isolating this issue I have replicated it by hand using the steps above.

    Even when doing an automated build of a cluster we only ever make changes to one node at a time and after we started seeing issues I inserted sleeps between API calls to see if I could avoid timing issues.

    If it's any further help, we were seeing this issue after a call to /controller/addNode in our automated system. The issue still occurs when doing things manually though.

    Top
    • Login or register to post comments
    Fri, 11/25/2011 - 17:33
    ingenthr
    Offline
    Joined: 03/16/2010
    Groups:

    Yes, thanks for all of the additional information.

    I've checked around a bit and the initial reaction is that the message you observe there is benign. Something else may be leading to it though. Can you get diags from all of the nodes and either put them somewhere for us to download or attach them to an issue on www.couchbase.org/issues? (note: same login works there as for the forums)

    Top
    • Login or register to post comments
    Fri, 11/25/2011 - 18:43
    Conor
    Offline
    Joined: 11/22/2011
    Groups: None

    I re-ran the test and got an apparently related problem (slightly different symptoms). The details are filed in this issue: http://www.couchbase.org/issues/browse/MB-4476

    Thanks again for your help.

    Top
    • Login or register to post comments
    Fri, 11/25/2011 - 22:30
    ingenthr
    Offline
    Joined: 03/16/2010
    Groups:

    Thank you for all the diligence on this one. I've already been in contact with some folks and passed along the issue. We should both see updates from there.

    Top
    • Login or register to post comments
    Sat, 11/26/2011 - 04:03
    Conor
    Offline
    Joined: 11/22/2011
    Groups: None

    Excellent, thanks very much. :)

    Top
    • Login or register to post comments
    • Login or register to post comments
    • Login
    • Register

    Company

    • About Us
    • Leadership
    • Customers
    • Partners
    • Contact Us

    Product

    • Couchbase Server
    • Couchbase SDKs
    • Use Cases
    • Documentation
    • Forums

    Open Source

    • Couchbase Project
    • Couchbase vs. CouchDB

    Commercial

    • Subscriptions & Support
    • Training & Services

    News

    • Blog
    • Newsletter
    • Press Releases
    • Buzz

    Follow Us

        
    • Customer Login
    • Terms of Service
    • Privacy Policy
    • Trademark Policy
    • Site Map

    © 2013 COUCHBASE All rights reserved.

    Sign in to Couchbase Community

    close
    • Create new account
    • Request new password
    You are logging into the Forums, Wiki and Issue Tracker