In order to properly write a program against a system, especially a complex system such as a distributed database, it is necessary to understand how that system behaves in situations that are outside normal operation. This document will try to describe the different situations and what application code can expect to need to handle in those situations.
|This document is still under development and review. Comments and questions are greatly appreciated.|
During a node addition, all data operations and query operations will continue to operate normally. Application code should not be able to discern (other than effects to resource utilization) that anything is happening.
When a node is removed, all data operations and query operations will continue to operate normally. Again application code should not be able to discern (other than effects to resource utilization) that anything is happening.
|There is a known issue with this behavior when using server releases prior to 2.0. With 1.6, 1.7 and 1.8 an application will see errors as the end of the rebalance completes. These will not be retried by the client. This condition is also, unfortunately, not distinguishable from a node failure.
The application code may retry this operation shortly after the rebalance completes and it will most likely be handled normally.
Application developers are advised to consider what is appropriate for their application in this circumstance.
The issue related to this behavior is MB-5406.
In this scenario, data is migrating from the node being removed to the node being added. Frequently this is referred to as a swap rebalance.
All data operations and query operations will continue to operate normally. Again application code should not be able to discern (other than effects to resource utilization) that anything is happening.
|Application code may see the same errors described in the section above about removing a node due to the same underlying issue. See the note in that section for recommendations.|
The exact behavior depends on whether the operation is related to a data update or modification, or a query.
If the the requested operation is a data operation, such as a retrieving a document with a get() or storing a document with a set(), the success or failure of the operation would be based on which node it is supposed to be for. In Couchbase Server, this is determined by a hashing algorithm driven by the key/id for the given document. For example, "foo" could be mapped to the hostname node1.example.com and "bar" could be mapped to node2.example.com. In the event the server for node1 fails, any attempt to modify the key/id "foo" will eventually fail. It will not necessarily immediately fail, as the client library will continue to try to reconnect to the node. As long as the operation requested is still valid and has not reached the timeout threshold, the client will continue to try to service the request.
If the requested operation is a query, the request will be routed to an available node. The JSON response from the cluster will contain an errors array, and the client may indicate an error condition. Through modification of the on_error parameter to the query, the client may request different behavior from the server. Note that this means a client may get query results which are incomplete and it is the responsibility of the application to handle these error conditions. See the Couchbase Server documentation for more info.
When all nodes fail, no operations will be available. If the client had started and been working with a cluster previously, it would continue to try to connect to that cluster until it becomes available or the client is terminated.
If a newly instantiated client tries to connect to the cluster, it will fail to initialize.
When the Bootstrap Node for a Client Fails
If the bootstrap node, which is always the first URI in the list of URIs supplied to the client, happens to fail, the client will walk the list of URIs until it finds a system which can service it's requests.
This is currently implemented two different ways.
In some clients (Java), the persistent failure of operations over a period of time is treated as a reason to distrust the current operating configuration. This will make the client attempt to re-bootstrap.
In other clients (.NET), the client will make a small request of all nodes of the cluster on a regular basis to verify connectivity. If any node is seen as not available, the client will attempt to re-bootstrap.
Starting a Client When the Cluster is Unreachable
The client will fail to initialize. The application should handle this condition appropriately.