Node Flap ...

Gentlefolk,

I'm trying to bring up a 3 node cluster on AWS. I am using the commercial CB v2.2.0 Enterprise Standard AWS AMI images authored by CB. These are being instantiated on m1.xlarge instances using local ephemeral storage. The data and indices are on separate local spindles. Replicas are being maintained. In other words, this is a pretty vanilla cluster that I'm trying to bring into a production validation process. There is a fourth node, the Python client SDK node running on top of Ubuntu 13.10. I have top running on it and it seems to maintain good connectivity back to my office. The client process is taking, disappointingly, less than 50% of a single core. (On my local test infrastructure, on smaller machines, the local client maxes out a single core.)

Everything appears to be operating correctly … except when it doesn't.

The cluster successfully took a 33 million document/30 GB cbrestore. When I try to use the same script that created the aforementioned DB and add new documents, one of the three nodes flaps into and out of the cluster. While this is a great test of my idempotent document insertion strategy, it is an extreme PITA. My threaded client handles the problems pretty well but the timeouts really hit insertion performance.

As an AWS newbie, my questions are:

1) How should I have launched these images to minimize cluster flap? Is there a command or zone I can specify in the launch process? (All three nodes are in the same availability zone.)

2) Is there a better image to use? I am following the advice from Amazon in their paper describing how to use CB on their cloud.

3) It has kind of repaired itself but I'm still seeing flap "glitches" in the console display. Where would I look in the logs to see some more detail as to what is going on?

Thank you in advance for any insight you care to share.

Anon,
Andrew

1 Answer

« Back to question.

Andrew, sorry to hear you're having troubles. We'd have to take a look deeper into the logs to know for sure what's going on, but a few points to check:

-We've recently seen some issues with THP in various environments. Can you check that your choice of operating system has this turned off (cat /sys/kernel/mm/transparent_hugepage/enabled)?
-It sounds like you may be approaching some sizing limits...you're putting 75GB into only ~36GB of RAM. While it shouldn't be a problem on paper, it will depend on how large your items are and how quickly you're trying to insert them. Can you share a screenshot of the "summary" graphs on your bucket in the hour/day timeframe? I wouldn't say that this behavior is expected when undersized, but it may be making it worse.
-On your application side, how frequently are you creating Couchbase client objects? We would always recommend trying to use a single object for as long as possible, and you may also want to look at turning on the "config_cache" (http://www.couchbase.com/wiki/display/couchbase/libcouchbase+configurati...) which will reduce the traffic back to the cluster.
-Any views configured? Can you share the definitions for them?
-Any XDCR?

Node flapping is usually a sign that the cluster managers are having problems communicating with each other. Assuming these nodes are installed in the same AWS region, there "shouldn't" be any network problems so the other likely culprits are disk IO and/or over-utilization of the beam.smp processes.

Lets start with the above and see where it goes...

Perry,

Thank you for getting back to me.

-We've recently seen some issues with THP in various environments. Can you check that your choice of operating system has this turned off (cat /sys/kernel/mm/transparent_hugepage/enabled)?

I am using CB's commercial AWS AMI. Things like huge page are configured by you. All I did to the image was add a second ephemeral spindle and pointed the indices there.

While this isn't from the cluster in question, it is from your AMI:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

-It sounds like you may be approaching some sizing limits...you're putting 75GB into only ~36GB of RAM. While it shouldn't be a problem on paper, it will depend on how large your items are and how quickly you're trying to insert them. Can you share a screenshot of the "summary" graphs on your bucket in the hour/day timeframe? I wouldn't say that this behavior is expected when undersized, but it may be making it worse.

It was my understanding that the key thing was to keep the metadata below half of the configured RAM and to only use 60% of the machine's RAM for CB. By my observation/calculations, I am using about 75 bytes/item for metadata for 2.25 GB total. The cluster has 4.5 GB/node for CB for 13.5 GB total. 2.25GB * 2 for the replica leaves me with 9 GB RAM for the rest of the machine. By my calculation, I can stick another 15M or so documents into this cluster. But this problem is showing up long before we get there.

My particular app mostly uses views. I don't, at this time, depend upon keeping documents in memcache. It is a mostly idempotent data application. If I could dedicate more of the machine towards metadata, I would.

As I've torn the problematic cluster down and moved up the server size chart, I cannot make you a screen shot. Tomorrow, I'll be spinning up a second node of the m2.4xlarge servers. Then I could make you a picture.

-On your application side, how frequently are you creating Couchbase client objects? We would always recommend trying to use a single object for as long as possible, and you may also want to look at turning on the "config_cache" (http://www.couchbase.com/wiki/display/couchbase/libcouchbase+configurati...) which will reduce the traffic back to the cluster.

My Python client spins up a connection document per thread and keeps them open for the life of the client. I have 7 threads operating. I move hundreds of thousands of documents through each of those connections. It has been stable. I log each timeout. They are rare except up when using this cluster.

-Any views configured? Can you share the definitions for them?

I have about a dozen views configured in three design documents. They are pretty straight forward views. I would be happy to share them privately with you.

-Any XDCR?

None.

I'm happy to add more data as I have it.

Andrew

Thanks Andrew. I hear you've been in touch with some of my other colleagues (Travis?) so if you could please send those view definitions to him we can take a look at them. He can also get you in touch with someone from my team (I'm based in London, our timezones won't exactly line up) to gather logs and dig in a bit deeper.

I agree with your responses to my previous comments, I was just throwing out what I could think of off the top of my head.

When you say that you're heavily using views...how frequently are you trying to query them and if you stop that, does the cluster stay stable?

Perry,

I'll certainly be in touch with Travis and will share the design documents with him.

That said, we are in the data ingestion phase of this project. The views are being calculated but no queries are even yet being exercised against the cluster. IOW, unless the index creation has cross node dependencies, a truly surprising discovery, if true, then I doubt you have a view issue. (I have tested the views against my local systems. I know that they meet my needs.)

I'll ping Travis when I demonstrate, or hopefully not, the same problem with the larger cluster. Then we can get logs.

In the meantime, I've modified my ingestion scripts to depend upon optimistic locking more and they are running about twice as fast. We'll see if they scale horizontally, as they should, when I add another node.

Andrew

Yes, given your latest comments I would tend to agree...but it might still be interesting to see what it looks like if those are turned off, even as a diagnostic step.

Can you describe a bit more how you were doing the ingestion without optimistic locking?

Perry,

I was still using optimistic locking. Just using it poorly. Old patterns from more traditional DB techniques had infected my code (de-duplication checking, primarily). In this app, I have some strong presumptions of immutability of the data once it is written. Hence, I could dispense with checking and just write the damn item. By and large this removed two extra round trips per insertion. Insertion performance has increased.

But my ingestor client is still running significantly slower than it does on my local test nodes. There is much more performance to be had. We'll see when the backup of the existing data finishes and the rebalance is done with the new node.

Andrew