Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.6.x

Membase instance becomes unstable and unreacheable after an hour

4 replies [Last post]
  • Login or register to post comments
Mon, 07/11/2011 - 10:12
theburningmonk
Offline
Joined: 06/29/2010
Groups: None

Hi,

Just had some weird problems with our production cache cluster which we can't seem to find any explanations for. Almost exactly an hour after deploying some changes to the cluster (going to a two-node cluster from a single node deployment, and adding a membase bucket that has 1X replication turned on) one of the servers (the new node) started behaving strangely:
- its CPU utilization shot up to 50% from around 5-8% and stayed there, logging onto the instance we were able to observe that all that CPU utilization was attributed to the membase.exe process
- as far as we could tell, this node was not functioning, our code couldn't talk to it, but it shows up as normal in the membase console however
- there were quite a few error messages in the Log at the exactly same time (10.212.203.146 is the bad server), such as this message:
Control connection to memcached on 'ns_1@10.212.203.146' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
stats_recv,
4},
{mc_client_binary,
stats,
4},
{ns_memcached,
handle_call,
3},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]} (repeated 5 times)
- when we run browse_log.bat we could a lot of errors :
ns_1@10.212.203.146:ns_memcached:374: Unable to connect: {error,
{badmatch,
{error,
system_limit}}}, retrying.

Some more context around the deployment:
- everything's deployed in Amazon EC2
- they're running on high CPU medium instances
- the get/store operations are all performed with cas
- average size of the objects stored in the membase bucket is around 44k and we had around 1.8k of these objects in the bucket at the time
- there were at most 300-400 ops/s across all the buckets (4 memcached, 1 membase) on the cluster
- disk space was not full, there was more than 17GB available

Since then, we have replaced the instances and reconstructed the cluster and everything seems to work so far (it's been an hour and half, so fingers crossed!), but we would really like to find out what went wrong the first time around and if we're not interacting with the cache correctly. Mind you, this is the first time we've come across this behaviour, having worked with multi-node deployments of NorthScale, and various versions of Membase with different topologies and mixture of membase and memcached buckets...

Any help would be much appreciated!

Thanks,

Top
  • Login or register to post comments
Wed, 07/13/2011 - 09:15
perry
Offline
Joined: 10/11/2010
Groups:

Can you send over a copy of logs from both nodes in the problem cluster? Those errors show that something had a problem connecting to the Membase process but it's unclear why at this point.

Also, I would probably recommend spending your AWS money on RAM instead of CPU in those instances.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Thu, 07/14/2011 - 04:46
peterlemonjello
Offline
Joined: 07/14/2011
Groups: None

I'm having similar issues while testing. A default Memcached bucket work fine but a Membase bucket fails sporadically.

Top
  • Login or register to post comments
Mon, 07/25/2011 - 02:02
theburningmonk
Offline
Joined: 06/29/2010
Groups: None

Hi Perry,

We had restarted the servers after the incident to not use replication and persistence so we don't have the log files anymore but we plan to try and reproduce the issue in our load testing environment, if we are able to reproduce the same issue I'll send you the logs for analysis.

So in general do you recommend using high memory instances instead of high CPU instances? I ran some test before, between the small and medium (high CPU) AWS instances the medium instance is able to handle twice the number of ops/s as the small instance so our feeling was that CPU helps and hence our decision to go with medium instances.

At the time of the incident we were only running at about 10-15% memory allocation so I didn't expect the problem to be a lack of memory in this particular case.

Regards,

Top
  • Login or register to post comments
Mon, 07/25/2011 - 15:15
perry
Offline
Joined: 10/11/2010
Groups:

In general we recommend more memory than less. CPU will be a determining factor at some point, but I doubt it is this case. I think it's more likely to be network saturation, and that certain AWS instances have more network bandwidth than others...we'd have to check with Amazon to be sure.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker