Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | Membase | Membase Server 1.7.x

Single Node Failure Scenario

6 replies [Last post]
  • Login or register to post comments
Thu, 07/21/2011 - 13:29
pkelly
Offline
Joined: 02/24/2011
Groups: None

We've got a new datacenter with about 20 nodes in it (all 1.7.0, btw), and it's been sustaining a moderate level of traffic for a while. We've had less luck with the hardware. In this case, the RAID controller failed, rendering the disk read-only.

The membase instance on this node is unhappy, needless to say, as its write queues fill up, writes timeout, etc. Our ops team hit the Fail Over button, but upon seeing data loss warning, realized that the node is still up -- it's just not able to write to the disk. We don't want to lose that data, especially since it's still in accessible memory.

They tried Remove, but of course that failed, because the node needs to be healthy to participate in the rebalance.

In the datacenter, when this happens, we see cascade failure of the entire cluster :

* First, we see "I'm not responsible for this vbucket" errors from requests that I presume would route to the failed node.
* In the membase logs, we see failures from nodes that replicate to the failed node.
* Eventually (22 minutes later in the logs I'm looking at), we start to see timeouts from all membase activity across the board.
* At that point, the datacenter fails over

For the immediate membase problem, Fail Over is the the right answer, but the whole scenario leads to a couple of other questions:

1. Each of these nodes has 56 GB of RAM, of which only a fraction is used. We could fill queues for a long time before exhausting the memory. Let's say we catch the situation after spooling up 10GB of data that should live in the local vbuckets. If we hit Failover, can we expect a large percentage of that data to have propagated to the replicas, or are we going to lose all 10GB? Along the same lines, what's the sequence of events involved in the replication?

2. The cascade failure is a little unexpected. How exactly would the failure of one node back up the rest of the nodes in the cluster?

3. It would be great (for my pathological case, and without thinking about complications) if the failing node would simply keep running, reading from the disk but failing to write to it. We'd manually remove it from the cluster, and lose no data. What prevents it from operating like that in this case?

Thanks,

Paul

Top
  • Login or register to post comments
Thu, 07/21/2011 - 14:03
perry
Offline
Joined: 10/11/2010
Groups:

Hey Paul, sorry you're having troubles. Thanks for the detailed description though.

Let me try to address your questions and then we can go from there:
1 - As long as the network between nodes is good, you can be pretty sure that replication is happening properly. If you look in the GUI, under the "TAP" section of statistics, you should be able to see how may outstanding items are in the replication queue for that particular node. If it's very low, you're probably in good shape to fail it over even if it's still live. Replication and disk writing are their own separate queues and processes, so they really happen independently of one another. Can you clarify what you mean by the "sequence of replication"?

2 - I would agree this is fairly unexpected. Can we get some logs (collect_info) from both the failing node and the ones that "shouldn't" be affected. There could also be something higher up the stack (i.e. the client library) that is getting messed up when it doesn't get answers in a timely fashion for some of the data.

3 - I would also think that this should work by design...but I'm thinking that out inability to create the journal file entries for writing to sqlite is also somehow affecting our ability to get the data out. I believe 1.7.1 will help greatly here if all the data is in RAM.

Thanks again, let us know how else we can help.

Considering the size of your deployment, and obvious reliance upon Membase, I'd also like to make a shameless plug for an Enterprise license. Having someone to call up and walk through all the details here can be invaluable to making this a successful experience for you.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Thu, 07/21/2011 - 19:10
pkelly
Offline
Joined: 02/24/2011
Groups: None

Hi Perry, thanks for the reply.

1. By "sequence of events", I really meant something along the lines of "do replication and persistence events for a particular piece of data happen serially or in parallel". From your response, it sounds like they happen in parallel, which is good -- at least up until the point where the whole cluster goes out.

2. There's a hitch in that -- with the failure of the RAID controller (which renders the disks read-only), there's no data in the log on the failing machine starting at the point where the controller goes out. It starts up again after we restart the machine (up to the point where the controller fails again) but by then we've missed all the fun.  That being said, I'll forward the both those plus a client log.

The client (spymemcached in this case) starts timing out as soon as the first failure is noticed, and a few minutes later we start seeing these -- and this looks to me like suddenly the vbucket-to-server mapping is suddenly messed up:

[Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11213}] ERROR net.spy.memcached.protocol.binary.StoreOperationImpl - Error:  I'm not responsible for this vbucket
2011-07-21 14:50:43,175 [ObjectStoreWriter-1] ERROR com.mybuys.platform.persist.MembaseObjectStore - Error storing key 21123454353
java.util.concurrent.ExecutionException: OperationException: SERVER: I'm not responsible for this vbucket
	at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:72)
	at com.mybuys.platform.persist.MembaseObjectStore.store(MembaseObjectStore.java:221)
	at com.mybuys.platform.persist.FrontEndObjectStoreImpl.put(FrontEndObjectStoreImpl.java:346)
	at com.mybuys.platform.persist.FrontEndObjectStoreImpl$OperationBackground.run(FrontEndObjectStoreImpl.java:459)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: OperationException: SERVER: I'm not responsible for this vbucket
	at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:132)
	at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:148)
	at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:134)
	at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:392)
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:324)
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
	at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)

 

Also, in looking through the client log, I notice that even the data going to/from memcache-type buckets are timing out.  It's possible that they're stuck behind other requests, I suppose, but were that the case, I'd be interested to understand why they're able to "back up".

3. Sounds like 1.7.1 has some significant advantages in terms of automatic failover, etc.  Is there something in particular that will allow it to run in "diskless" mode?

 

 

One other thing I'm interested in determining as a result of all this is how to get an early warning that this is happening, using nagios or some such.  For example, I'd expect that, even though we're not writing disk stats, the REST interface should be working, and we ought to see the disk write queue growing out of control.  Perhaps that, combined with the replication queue draining somewhat normally could indicate such an event.  Would that make sense?

 

Thanks for the help,

 

Paul

Top
  • Login or register to post comments
Fri, 07/22/2011 - 14:26
perry
Offline
Joined: 10/11/2010
Groups:

Thanks for the details Paul, shame about the log...makes sense though unfortunately.

I'll ask one of our SDK engineers to take a look at the spy issues (you know we wrote it right?). What version are you running?

There's nothing special about a "diskless" mode in 1.7.1, just general stability improvements, especially around rebalancing.

There is actually a Nagios plugin avialable: http://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check.... I believe it's slightly outdated for 1.7, but should be very easy to augment. In addition to the general disk queue stats, you'll probably want to focus on the "ep_commit_failed" stat which I'm pretty sure would be pretty high on that server. This indicates a failure writing data to disk. A few transient ones aren't too bad (we retry) but any growing number here is your first indication of a problem with the disk.

Hope that helps.

Perry

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
Fri, 07/22/2011 - 15:32
pkelly
Offline
Joined: 02/24/2011
Groups: None

Indeed. We're on version 1.6 of spymemcached, last entry in changelog.txt is:

commit 740a9c63e887623c678049b7f1003e2f5372dcb8
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed May 11 12:05:04 2011 -0700
 
    Removed a bit of dead test code.
 
    Change-Id: I295d7b6b301217866f1074c526cdeba6d60420ab
    Reviewed-on: <a href="http://review.membase.org/6152<br />
" title="http://review.membase.org/6152<br />
">http://review.membase.org/6152<br />
</a>    Tested-by: Matt Ingenthron <matt@northscale.com>
    Reviewed-by: Matt Ingenthron <matt@northscale.com>

Top
  • Login or register to post comments
Fri, 07/22/2011 - 15:55
pkelly
Offline
Joined: 02/24/2011
Groups: None

That's "ep_item_commit_failed", right?

Being a toplevel stat, if I have multiple buckets, each on a specific port (11211 - 11215), if I run

echo stats | nc <server> 11211 | grep "ep_item_commit_failed"

is that going to give me the aggregate number of failures (across all buckets) for that node only?

Thanks,

Paul

Top
  • Login or register to post comments
Mon, 07/25/2011 - 14:05
perry
Offline
Joined: 10/11/2010
Groups:

No, that command above will give you the aggregate number of failures for just the one bucket across all nodes (because you're sending the request through a Moxi process).

There's no current way to get the stat across all buckets for a single node, but you can use this command to get the stat per-bucket, per-node:

/opt/membase/bin/mbstats :11210 all | grep "ep_item_commit_failed"

__________________

Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker