[JCBC-148] Issue with Observe API Persist.TWO and 1 dead node: Time Out when doing set operation Created: 18/Nov/12  Updated: 03/Dec/12  Resolved: 03/Dec/12

Status: Resolved
Project: Couchbase Java Client
Component/s: Core
Affects Version/s: 1.1-dp4
Fix Version/s: 1.1-beta
Security Level: Public

Type: Bug Priority: Critical
Reporter: Tug Grall (Inactive) Assignee: Michael Nitschinger
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2 nodes cluster on couchbase-server-community_x86_2.0.0-1947-rel
  1 node on Ubuntu (VM)
  1 node on OS X

Bucket configure with 1 replica

Attachments: Zip Archive CouchbaseSamples.zip    

I have a very simple Java program that connect to the 2 nodes and do a set with the following code:

1. So I try to connect to multiple nodes
            List<URI> couchbaseServerUris = new ArrayList<URI>();
            couchbaseServerUris.add( new URI("") );
            couchbaseServerUris.add( new URI("") );
            CouchbaseClient client = new CouchbaseClient( couchbaseServerUris , "default" , "" );

2. Then I call the set operation

        OperationFuture<Boolean> stored = client.set( "my-dummy-key",0, "{\"name\" : \"foo\", \"title\" : \"bar-test\"}", PersistTo.TWO);

So everything is working as expected when the 2 nodes are up.

When I kill 1 node (for example : disconnecting, or stopping, or pausing the Ubuntu VM) I have the following behavior:

When I execute this program:
1- I have an exception saying that 1 node is down : Expected behavior (even if we could avoid a long stack trace)
2012-11-18 08:14:55.830 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP: Connection problems with URI ...skipping
java.net.ConnectException: Host is down

2- When I do the set the program is stopped/blocked until it reaches a network timeout
2012-11-18 08:20:13.462 INFO com.couchbase.client.CouchbaseConnection: Shut down Couchbase client
Error while storing : Observe Timeout - Polled Unsuccessfully for at least 40 seconds.
2012-11-18 08:20:13.466 INFO done : true
done : {OperationStatus success=false: Observe Timeout - Polled Unsuccessfully for at least 40 seconds.}
com.couchbase.client.ViewNode: Couchbase I/O reactor terminated
2012-11-18 08:20:13.467 INFO com.couchbase.client.ViewNode: Couchbase I/O reactor terminated

Note that it is only happening with PersistTo.TWO
if I use PersistTo.MASTER or PersistTo.ONE : the program is executed with no error and no stop
if I use PersistTo.THREE ( or more) : the program is executed, no stop with the expected observe message : ( Error while storing : Requested persistence to 3 node(s), but only 2 are available.

Comment by Tug Grall (Inactive) [ 18/Nov/12 ]
Sample program
Comment by Matt Ingenthron [ 20/Nov/12 ]
I do believe that's actually expected behavior, but let's talk through it to get your opinion.

We have a couple of options in the state of unexpected failure. one is we try our hardest to get the operation requested of us done and we rely on timeouts to keep from blocking forever. The second is that we keep tabs on our connections, and if the connection is down, we fail operations immediately so as to not have the application code waiting for something that may or may not succeed.

Had you gone in and removed the second node (click 'remove' and 'rebalance'), then the client should have done something similar to when you requested three nodes. The failure you describe above is unexpected. Further, the client library doesn't really know if it's temporary or permanent.

Finally, I do want to note, and I think this is well documented, that many things with Observe protocol under them end in timeouts. This is not the only one. Generally speaking, application code should be ready to do *something* in the case of a timeout.
Comment by Matt Ingenthron [ 20/Nov/12 ]
Tug explained this further. The PersistTo.THREE check must be happening after doing some operations, which is a bit late considering this operation can never succeed. The failure should be the same with a cluster that has a down node as it is with a cluster that just doesn't have a primary and to replica locations.
Comment by Mike Wiederhold [ 21/Nov/12 ]
The way Rags wrote this code originally was to do the set and then the observe. The observe part is the part that does all of the checking so the set will actually go through an then you will get the error. Similarly there is no checking for downed nodes and I don't think we actually have the ability to do this at the moment, but I may be wrong.

On another note, one other thing I thing is wrong is returning an OperationFuture from all of the observe functions, but it isn't actually an asynchronous function.
Comment by Michael Nitschinger [ 30/Nov/12 ]
Comment by Michael Nitschinger [ 03/Dec/12 ]
fixed and will be available in the beta release.
Generated at Thu Apr 24 08:33:48 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.