Random spikes in membase cluster response time blocking all threads in nginx/passenger, causing requests to be dropped
(This is a new thread as the old thread title I created before is not relevant, and it's not clear to me that retitling the old thread will correctly bring over all the responses using this forum software)
I have a web app with 7 c1.mediums running 20 nginx/passenger processes apiece, each of which connects to a membase cluster of 3 membase m1.small instances.
Periodically the global waiting queue spikes to 60 on every box, which typically means one of the three external depedencies my web app has - MYSQL, Membase or Redis - is blocking or stalled. The end-user effect is that requests are dropped and the app appears non-responsive. Here is a graph to see what I mean.
I wrote some code to monitor the average latency experienced in each of my three components across all six of my front-end machines. I measure: the time to connect to MYSQL and run one query; the time to connect to REDIS and run a ping request, and the time to connect to memcached (moxi is running locally FYI) and run a single get request of an object that I know is not present in the cache.
Choppiness in the membase/memcached connect time, which corresponded to a spike in my global wait queue.
Heres is a graph showing membase starting to do substantial disk fetches around the same time as this lag begins - the :23 seems to be the high-water mark for delay and for disk fetches.
I am running the latest membase version - 1.6.5. Here is a link to diagnostic information I collected earlier about my cluster.
Has anyone else experienced this issue, and can anyone advise on a solution?