What happens when a node in the cluster goes down?

Scenario:
One cluster with two nodes running a bucket named test. The bucket type is Couchbase. The bucket is set up to have one replica.

I upload 10K documents to the bucket. The cluster now divides the data across the nodes. Node 1: 5K active / 5K replica Node 2: 5K active / 5K replica.
If you remove one node and rebalance the replica of the still active node moves all documents from replica to active. Now the active node has 10K active and 0K replica. This is perfect behavior.

Now lets say I have both my nodes working with node 1 having 5K active / 5K replica and node 2 having 5K active / 5K replica).
Clients are now polling data from the cluster at lets say a rate of 2K gets per second. Then suddenly node 2 goes down. Now when the clients are polling data they will only get a result when the document is part of the 5K active on node 1. I would expect that the clients should get their data but with a little delay since it now needs to also get data from node one's replica documents.

What happens now (at least when using the C# API) is that when one node goes down half of the data is unavailable to the clients until a rebalance is done. If you have lets say 1M documents you will have about 500K documents unavailable to the clients until rebalance is done.

Shouldn't the cluster always return the same result as long as the data exist in active or replica?

1 Answer

« Back to question.

Instead of a rebalance when a node goes down, what you really want to do first is failover. Rebalance will accomplish the same thing in the end, but failover will be much quicker.

When a node goes down, it's portion of the dataset becomes unavailable. All the other nodes will continue serving their data. At some point, there needs to be an explicit trigger to the rest of the cluster to make the replica vbuckets active for those that went down. This trigger can be either manual (by clicking the button), automatic (by using our auto-failover feature) or programmatic (by an external script triggering it). Manual will take as long as it takes for an admin to get to, automatic will take a minimum of 30 seconds (to ensure that the node is actually down) and programmatic can happen as quickly as an external system determines that the node is down. Once triggered, the failover is immediate. http://www.couchbase.org/wiki/display/membase/Failover+Best+Practices

We don't currently allow reading from replicas in order to enforce the very strong consistency that memcached/Membase/Couchbase provide.

Hope that helps explain things.

Perry

Thanks for a good answer.

I think it would be great if the API could be extended so when a query is ran against the cluster and a node is down the API would return that one node was down. We should also extend the API to have the possibility to query replica of a specific node. If we had those two functions we could extend the client libraries to have an option that says re-query when one or more nodes is down. If this option is set to true and the cluster return that node x is down the the client library would rerun the query but now only on the replica for node x. Then the library could merge the result and return it.

The best solution would of course be that the cluster had this function built in but that may be more complex to do.

Svein Helge

That's a fair request, but we don't currently allow access at all to replica vbuckets. So even if you could query them, they wouldn't respond.

Also, if at some point your application decides that a node is down and it should read from replicas...why not just fail it over?

Perry

So far I haven't found a way to get if a node was down returned when running a query against the cluster(I'm using the C# client).

I personally don't like the idea of having the client deciding when to do a fail over. I think that is a job for the cluster. If the client find a node down because of a network failure lasting only 1 second, I don't want it to do a fail over. The cluster already have functionality to fail over server after x number of seconds.

I really hope that we in the future get the possibility to query replicas. For me it's very important that if one node goes down the front end client should not be affected.

Let's say you have 5 nodes with 500M documents. Then one node goes down. The the front end client now have 100M documents unavailable for 30 seconds. If you have let's say 10K query every second you will have at average 2.5K queries every second returning with an incomplete result for 30 seconds. That would mean 75K queries returned with incomplete data. I don't think that is very good.

This problem could be resolved if the cluster returned which nodes was down during a query and added the possibility to query replica. Then we could just upgrade the client libraries to query the replica when a node is down.

I think this would be a great improvement. This would also make the cluster more robust. If someone take a server down without calling fail over it wouldn't matter. The client would still get the correct result. This would also allowing the use of more cheap hardware since the cluster would compensate for failing hardware.

Don't take any of this as any form of criticism because i think Couchbase Server is a brilliant product. These are just my suggestion to how we can get the cluster to have as close to 100% uptime seen from the client side. Today the clients will have data unavailable every time a node goes down uncontrolled.

Thanks,
Svein Helge

Thanks Svein, I'm happy to take your suggestions and glad to hear you like so highly of Couchbase Server.

The real big issue here is consistency...

If a single client loses some network connectivity to one node, and decides to go get the data from another one...it's very hard to ensure that this client sees the same data as other clients (who may have not lost network connectivity). In some cases, it would be okay...in most cases, not.

By requiring an explicit failover, we ensure that ALL clients (no matter their connectivity) will have a consistent view of the cluster.

It's still a good suggestion, we've definitely been considering ways of doing it and I will make sure your feedback gets back to our product management team.

Perry

In terms of consistency you have that problem already today.

Lets say I have several clients running a total 2000 queries every second towards the cluster over a period of 1 minute. The data inside the cluster is not updated during this period. If one of the nodes is down for 1 second the clients will not get have consistent result during this period. Because some of the clients will not get data from the node that was down for a second.

So the question should be more what is best:

Today when we have 3 nodes and one goes down, we will have 33% of the data unavailable until the fail over is triggered.

The alternative would be that during the period between when a node goes down and the fail over i ran some clients may in some cases get different results.

I personally thinks it's better that maybe some times you can get different results rather then having 33% of data unavailable. Wouldn't that be close to what we call with nolock in SQL servers.

We could put this as an option and call it read optimistic or something. This way default behavior is like today but you have an option to ask the cluster "optimistically" when a node is down.

The main goal is that when a node goes down uncontrolled or losses network connection for a second all data should still be available to the clients. This way the cluster would be much more robust and be more like "crash only" design for nodes.

I really hope your team will find a way to have all data available during the period when a node goes down to the fail over is ran.

Thanks
Svein Helge

I think we're on the same page, and making it an extra feature is on our horizon of things to do.

However, I disagree with you about the current state of things. We're more concerned with writes than reads, since that's where things can get out of sync. With the current design, there is zero possibility of getting inconsistent data...though you have to be okay with possibly not getting data at all (read about the CAP theorem...while it doesn't totally apply to distributed databases, we are CP in that we are consistent but not always available in light of a network partition)

With Membase, you can either access the authoritative copy of the data, or you cannot access it at all. Other systems will allow you to continue accessing stale data...it's simply a matter of design, we chose very explicitly one behavior.

I don't disagree that this would be a useful feature, but it would definitely break the strong consistency that we provide today.

Perry

I agree that we are one the same page.

I don't see why reading from replica can cause writing issues.
Since a fail over is instant is there a way we could simulate a fail over for a client reading data. We are only speaking of reading data.

Writing always returns error when something goes wrong but reading don't return any error just return nothing or part of the correct result set. So there is no way the client can know that anything is wrong. (C# client library)

I really hope you can figure a way so the client reads are not affected every time a node goes down or the network is unstable.

You're right, it doesn't cause writing issues...but if client A has a different view of the system than client B, you might write something from A and have B read from a replica...now you've got the potential for inconsistency.

What do you mean that reading doesn't return an error? It most certainly does (if you're looking for it) either as a timeout or an actual error message.

We're actually working on some things right now to enable replica reads...however, an unstable network is still going to be a problem if your clients can't actually reach the Membase nodes. One of our major benefits is that each node is still able to serve its own data even if the network between them is down.