Hi, We are using couchbase java client 1.4.5. and we find the FailureMode.Redistribute does not work as expected.
As the API said
Move on to functional nodes when nodes fail.
In this failure mode, the failure of a node will cause its current queue and future requests to move to the next logical node in the cluster for a given key,
However, when we manually drop all requests in a Couchbase server node, the client still try to push some documents to that failed nodes instead of distribute to other nodes.
So here is my question:
1.How does Couchbase determine which is failure node? Do we need to manually remove the failure node from node list in client?
2.If it detects the failure node, why it still trying to connect to the same node?
if you use FailureMode.Redistribute, the Java Client will requeue the operation until the master node for the partition comes back online or is failovered to a different point. So we are trying to achieve best-effort deliver up to the point where the operation times out.
If you are using FailureMode.Cancel, what happens is that instead of requeuing the operation, we just cancel it right away - so you get some kind of fail-fast behavior.
Note that the difference to regular memcached buckets here is that there is only ever one specific node where we can find the data, so we need to wait until it is accessible again or cancel it.
Thank you so much for you reply, really helpful.
I understand the request will wait for the master nodes’ fail-over. However what if the master node is ok in cluster but the client has some connection issue with the master node? Under such case, redistribution will always failed since it always go to the same node.
So my question is whether there is away to rebuild the hash to let the key go to another node instead of the disconnected node?
From a client point of view it doesn’t matter if you have a connection issue or the node is really down, it can’t access it so it’s not able to read and write data to this partition.
Keep in mind that Couchbase wants to make sure that your data is safe, so we cannot just write to another node. There is always only one node available to write for a given partition. You can read from replicas though if you are clear about the tradeoffs (data consistency).
The client will try to reconnect the socket with a increasing interval but will try forever until you shut it down or it gets a new configuration, indicating that the target node is gone (for example because you initiated manual failover or automatic failover has been triggered).