SDK 2.7 returns OperationTimeout error status during server graceful failover or cluster changes

I am testing .NET SDK version 2.7.16 connecting to Server Enterprise 6.5.1 with 2 nodes running in Docker. I referenced getting-started-docker guide for the docker setup and the official image.

The test app performs simple Get operation from a Couchbase bucket continuously. The issue I see is when adding or removing nodes (with and without Graceful Failover) from the cluster, the client side errors out randomly with OperationTimeout status during the change. The errors stop after the cluster changes are completed. I was expecting the SDK to gracefully handle bucket remapping as cluster changes. From the application perspective, the Get operation should succeed in retrieving from the bucket without special consideration in this specific scenario.

I debugged the sdk code on github and it seems to be a mismatch in the timeout value in seconds vs milliseconds.
In CouchbaseRequestExecuter.SendWithRetry /release27/Src/Couchbase/Core/Buckets/CouchbaseRequestExecuter.cs#L596)

if (CanRetryOperation(operationResult, operation) && !operation.TimedOut())

the operation can be expected to fail with NOT_MY_VBUCKET error as the cluster map changes. However, operation.TimedOut() should be returning false so the operation can be retried using the updated map.

I found that Operation.TimedOut
/release27/Src/Couchbase/IO/Operations/OperationBase.cs#L463 is comparing elapsed time in milliseconds.

var elasped = DateTime.UtcNow.Subtract(CreationTime).TotalMilliseconds;
if (elasped >= Timeout || (ErrorCode != null && ErrorCode.HasTimedOut(elasped)))

But the timeout value is specified by CouchbaseBucket.Get
/release27/Src/Couchbase/CouchbaseBucket.cs#L974 as seconds.

var operation = new Get(key, null, _transcoder, timeout.GetSeconds())

Because of this mismatch in time unit, any underlying error would be almost always considered timed out and the status overwritten by /release27/Src/Couchbase/Core/Buckets/CouchbaseRequestExecuter.cs#L618.

((OperationResult)operationResult).Status = ResponseStatus.OperationTimeout;

Could someone confirm my above finding or suggest otherwise? Thanks.

PS sorry I couldn’t post the proper links above. They are relative to the release27 branch in github.

1 Like

@ccmobile -

Thanks for the excellent post! I’ll look into this and get back to you!

-Jeff

I have been able to reproduce the same results. Is there a patch for this? In our scenario this is pretty much a show stopper. Any restart of any node causes our application to start receiving timeouts . The only remedy is to recycle the application (ASP.Net). We are considering testing for the timeout, resetting the client then retrying. This is officially frowned upon but may be our only option.

Has their been any word on this? I feel like all the time being spent on dealing with timeout issues and how to get around them are related to this.