Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.6.x

Random spikes in membase cluster response time blocking all threads in nginx/passenger, causing requests to be dropped

5 replies [Last post]
  • Login or register to post comments
Tue, 02/15/2011 - 12:37
esilverberg
Offline
Joined: 01/03/2011
Groups: None

 Folks,

(This is a new thread as the old thread title I created before is not relevant, and it's not clear to me that retitling the old thread will correctly bring over all the responses using this forum software)

I have a web app with 7 c1.mediums running 20 nginx/passenger processes apiece, each of which connects to a membase cluster of 3 membase m1.small instances. 

Periodically the global waiting queue spikes to 60 on every box, which typically means one of the three external depedencies my web app has - MYSQL, Membase or Redis - is blocking or stalled. The end-user effect is that requests are dropped and the app appears non-responsive. Here is a graph to see what I mean.

I wrote some code to monitor the average latency experienced in each of my three components across all six of my front-end machines. I measure: the time to connect to MYSQL and run one query; the time to connect to REDIS and run a ping request, and the time to connect to memcached (moxi is running locally FYI) and run a single get request of an object that I know is not present in the cache. 

Here is a graph that implicates membase as the root cause.

Choppiness in the membase/memcached connect time, which corresponded to a spike in my global wait queue. 

Heres is a graph showing membase starting to do substantial disk fetches around the same time as this lag begins - the :23 seems to be the high-water mark for delay and for disk fetches.

I am running the latest membase version - 1.6.5. Here is a link to diagnostic information I collected earlier about my cluster. 

Has anyone else experienced this issue, and can anyone advise on a solution? 

Many thanks,

Eric

Top
  • Login or register to post comments
Tue, 02/15/2011 - 13:00
perry
Offline
Joined: 10/11/2010
Groups:

 Hey Eric, continuing our previous conversation...

 

It would appear to me that your analysis is all correct.  The main issue is that you're requesting some data from disk in Membase and the latency that you are getting is not acceptable to your application.  At a high level, the solution should be to give Membase more RAM so that it keeps all the data "resident".  Can you add a 4th server to the cluster and see if these problems go away?  Even if it's just temporarily for diagnostics, it will validate the problem.

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Tue, 02/15/2011 - 13:13
esilverberg
Offline
Joined: 01/03/2011
Groups: None

OK, I will do even better and add two additional servers, and report back. Thanks again for all your help Perry! 

Top
  • Login or register to post comments
Tue, 02/15/2011 - 21:10
esilverberg
Offline
Joined: 01/03/2011
Groups: None

I can now positively confirm that disk writes, caused when you have consumed approximately 75% of your available memory as per the docs, cause requests to membase to spike from ~10ms to ~4000ms during the disk write operation. This is what caused the random latency spikes and dropped requests in my web app.

Obviously a 100x latency spike like that is substantial, and I don't imagine many web apps would handle such behavior well, but I guess that is by design for now.

I have spun up a new cluster with two m1.large's, increasing my available RAM by 4x, so hopefully that is a large enough for my working set and I won't encounter disk writes causing this kind of latency anymore.

I can also say that when I attempted to add a new server to my existing 3-server m1.small cluster to increase the total available RAM, the rebalance operation took 2.5 hours, after which point the entire cluster died, sending SERVER_ERROR proxy write to downstream errors. That took my whole app offline for 45 minutes while I had to spin up and warm an entirely new membase cluster. At this point, I will no longer be dynamically adding servers to my cluster, at least not with version 1.6.5. 

I have since moved from using multiple m1.smalls to 2 m1.larges, and hopefully won't have to build a new cluster for a while.

Thanks for your help!

-Eric

Top
  • Login or register to post comments
Wed, 02/16/2011 - 11:39
perry
Offline
Joined: 10/11/2010
Groups:

 Glad to help Eric, thanks for the feedback and sorry you ran into some more problems.  From your other post it sounds like you ran out of disk space which can be quite a big problem.  We've got some task items to improve our behavior when this happens, but if you look at other solutions (MySQL for example) they don't provide you any better behavior and it's up to the administrator to make sure there is enough disk space.

 

Also, just to clarify, your increased latency is coming from disk READS not writes since a write is always done asynchronously.  If the application requests a piece of data that is no longer in RAM, Membase has no choice but to retrieve it from disk...

 

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Wed, 09/28/2011 - 08:33
jordanh2j
Offline
Joined: 09/21/2011
Groups: None

Would you be taken with exchanging hyperlinks?
123Inkjets thistly Garmin nuvi 1300 hepatopathy IKEA mucoperiosteum Ebay guitars Blush

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker