Node failure blocks Java client

Using Couchbase 3.0.2 Enterprise on Debian v7 and Java client 2.1.1.
I created a simple test scenario while I was continuously updating one document. During the test I kill one of the 5 nodes of the cluster (exactly the node where the master data of the document exists) and I got a java.util.concurrent.TimeoutException. So far this is expected, but after that I try to read the document from its replicas, there is two replicas on other working nodes and I still get the TimeoutExceptions for those calls too.

I suppose querying the first or second replica should work without any delay.

Please help if this is a bug in my app or in CB or there should be some other configuration to address this issue.

Code: https://gist.github.com/anonymous/c663676de78a6fe14797
Log: https://gist.github.com/anonymous/a101e7f5df903c4a54db

1 Like

hi @sini_ustream,
I’ll try to reproduce it, thanks for sharing the code and DEBUG logs.

what’s the timeline of your test? you identify the node on which the doc is active in normal operation, then what? do you gracefully failover the node in the console? hard failover? ssh to the node and stop the service?
do you do a rebalance at some point?
thanks for the details you can provide :wink:

Hi,

As there is only 1 documentum in the whole cluster, I can easily see on the admin console where its master copy is. So after I started my test application, I killed that node. As the nodes are virtual machines, I just poweroff-ed them, so there was no graceful failover or shutdown, just as it happens in real life with real servers :smile:. After some seconds (I guess not more then 3-5 seconds), I get the TimeoutExceptions.

Peter

So you don’t do a rebalance? Here is what I see when I hard failover the active node and don’t do a rebalance (coherent with what you saw):

  • the active copy changes to be one of the replica nodes (in my test, it was node5 and is now node3, and cbc-hash command line utility confirms that)
  • there’s only one replica left (same here, cbc-hash walter -h node1 shows that second replica is “N/A”)
  • reading from ReplicaMode.SECOND fails in timeout (because there’s only 2 copies in the whole cluster at this point, the 3rd is on the downed node)
  • attempting to upsert with ReplicateTo.TWO fails (because without rebalancing, the identified second replica for the key is not available)

After rebalancing, everything goes back to normal. A new node is elected as the 2nd replica and the operations with replication factor of two work again.

Yes, after rebalancing it works fine. But my opinion is that until the rebalancing is executed, the application should still be able to work. Rebalancing is manual or can be automatic after 30 seconds, but this means that the application won’t be available for at least half of a minute. That is waaaay to much for me.

What I do not understand here is that why the replicas are not available. When I call

bucket.getFromReplica(TEST_DOC_NAME, ReplicaMode.FIRST)

the server storing that replica is up and running, its location should be known for the client or for the cluster, so I do not see why it cannot be returned.

The update method is a different thing, I would like to see in that case that the cluster sees in 1-2 seconds that a node is down and quickly reacts to that so that value can be updated with a small delay.

@sini_u

I have same problem with u.
Check this article

As far as I see, the replica reads are implemented, at least in the Java client, that is the call that I use in my example. But it does not work, or my understanding of how it should work is wrong.

Anyway, after reading the referenced article I think that the failover in general is fine, but we should be able to do it as soon as possible, in my case not later then a second after a node is down.

Automatic failover is possible now, but the cluster waits at least 30 seconds before doing it. My case those not allow 30 seconds of unavailability, so if we could reduce that timeout to around 1 second, It would be great!

Is it possible to reduce the 30 second minimal time for the automatic failover?
I know that the configuration does not allow less then seconds, so I guess some other mechanism or a tuned version of the current implementation should be there in the DB to support faster failover.

The failover could be made quicker than the minimum 30 seconds, but that would mean scripting it in an external monitoring system in which you put more confidence in detecting quickly (and with a low rate of false positive) that a node is down.
This is not a concern that the client can handle…

For getFromReplica, you can and should use ReplicationMode.ALL. This way if a replica is promoted via failover but the cluster unbalanced, you won’t try to explicitly target a replica that temporarily doesn’t exist anymore.

For writing, when a node has been failed over you wont’ have your required number of replicas available until you rebalance. If you try to write with ReplicateTo.TWO during this period, it will fail because there’s not enough replicas. You may isolate this part and fallback to a ReplicateTo.ONE during the failover -> rebalance period maybe? @daschl any other idea?

I have same issue. As you said we need to re balance after one node goes down.My question is how can I re balance pro grammatically?

You probably don’t want to rebalance programmatically. It’s possible from the REST API, but you will probably will want to take action considering the failure. As mentioned in the thread above, your application code can certainly change behavior if your app specific logic allows. The client can’t, for instance, actually do a ReplicateTo.ONE if there’s no other place to replicate to for the time being.

Hello,
I production we may not rebalance once you one node fails or it may take time, during this time java client is not responding.How we can can handle in this case.
I am using java-clinet 1.4.12 with 2 node.

@nitinvavdiya please see the other thread you commented on for more answers on this topic.