Node Flap ...
I'm trying to bring up a 3 node cluster on AWS. I am using the commercial CB v2.2.0 Enterprise Standard AWS AMI images authored by CB. These are being instantiated on m1.xlarge instances using local ephemeral storage. The data and indices are on separate local spindles. Replicas are being maintained. In other words, this is a pretty vanilla cluster that I'm trying to bring into a production validation process. There is a fourth node, the Python client SDK node running on top of Ubuntu 13.10. I have top running on it and it seems to maintain good connectivity back to my office. The client process is taking, disappointingly, less than 50% of a single core. (On my local test infrastructure, on smaller machines, the local client maxes out a single core.)
Everything appears to be operating correctly … except when it doesn't.
The cluster successfully took a 33 million document/30 GB cbrestore. When I try to use the same script that created the aforementioned DB and add new documents, one of the three nodes flaps into and out of the cluster. While this is a great test of my idempotent document insertion strategy, it is an extreme PITA. My threaded client handles the problems pretty well but the timeouts really hit insertion performance.
As an AWS newbie, my questions are:
1) How should I have launched these images to minimize cluster flap? Is there a command or zone I can specify in the launch process? (All three nodes are in the same availability zone.)
2) Is there a better image to use? I am following the advice from Amazon in their paper describing how to use CB on their cloud.
3) It has kind of repaired itself but I'm still seeing flap "glitches" in the console display. Where would I look in the logs to see some more detail as to what is going on?
Thank you in advance for any insight you care to share.
Andrew, sorry to hear you're having troubles. We'd have to take a look deeper into the logs to know for sure what's going on, but a few points to check:
-We've recently seen some issues with THP in various environments. Can you check that your choice of operating system has this turned off (cat /sys/kernel/mm/transparent_hugepage/enabled)?
-It sounds like you may be approaching some sizing limits...you're putting 75GB into only ~36GB of RAM. While it shouldn't be a problem on paper, it will depend on how large your items are and how quickly you're trying to insert them. Can you share a screenshot of the "summary" graphs on your bucket in the hour/day timeframe? I wouldn't say that this behavior is expected when undersized, but it may be making it worse.
-On your application side, how frequently are you creating Couchbase client objects? We would always recommend trying to use a single object for as long as possible, and you may also want to look at turning on the "config_cache" (http://www.couchbase.com/wiki/display/couchbase/libcouchbase+configurati...) which will reduce the traffic back to the cluster.
-Any views configured? Can you share the definitions for them?
Node flapping is usually a sign that the cluster managers are having problems communicating with each other. Assuming these nodes are installed in the same AWS region, there "shouldn't" be any network problems so the other likely culprits are disk IO and/or over-utilization of the beam.smp processes.
Lets start with the above and see where it goes...