SDK 2.7 returns OperationTimeout error status during server graceful failover or cluster changes

ccmobile · May 6, 2020, 9:11pm

I am testing .NET SDK version 2.7.16 connecting to Server Enterprise 6.5.1 with 2 nodes running in Docker. I referenced getting-started-docker guide for the docker setup and the official image.

The test app performs simple Get operation from a Couchbase bucket continuously. The issue I see is when adding or removing nodes (with and without Graceful Failover) from the cluster, the client side errors out randomly with OperationTimeout status during the change. The errors stop after the cluster changes are completed. I was expecting the SDK to gracefully handle bucket remapping as cluster changes. From the application perspective, the Get operation should succeed in retrieving from the bucket without special consideration in this specific scenario.

I debugged the sdk code on github and it seems to be a mismatch in the timeout value in seconds vs milliseconds.
In CouchbaseRequestExecuter.SendWithRetry /release27/Src/Couchbase/Core/Buckets/CouchbaseRequestExecuter.cs#L596)

if (CanRetryOperation(operationResult, operation) && !operation.TimedOut())

the operation can be expected to fail with NOT_MY_VBUCKET error as the cluster map changes. However, operation.TimedOut() should be returning false so the operation can be retried using the updated map.

I found that Operation.TimedOut
/release27/Src/Couchbase/IO/Operations/OperationBase.cs#L463 is comparing elapsed time in milliseconds.

var elasped = DateTime.UtcNow.Subtract(CreationTime).TotalMilliseconds;
if (elasped >= Timeout || (ErrorCode != null && ErrorCode.HasTimedOut(elasped)))

But the timeout value is specified by CouchbaseBucket.Get
/release27/Src/Couchbase/CouchbaseBucket.cs#L974 as seconds.

var operation = new Get(key, null, _transcoder, timeout.GetSeconds())

Because of this mismatch in time unit, any underlying error would be almost always considered timed out and the status overwritten by /release27/Src/Couchbase/Core/Buckets/CouchbaseRequestExecuter.cs#L618.

((OperationResult)operationResult).Status = ResponseStatus.OperationTimeout;

Could someone confirm my above finding or suggest otherwise? Thanks.

PS sorry I couldn’t post the proper links above. They are relative to the release27 branch in github.

jmorris · May 7, 2020, 3:56am

@ccmobile -

Thanks for the excellent post! I’ll look into this and get back to you!

-Jeff

dpupek · July 1, 2020, 8:53pm

I have been able to reproduce the same results. Is there a patch for this? In our scenario this is pretty much a show stopper. Any restart of any node causes our application to start receiving timeouts . The only remedy is to recycle the application (ASP.Net). We are considering testing for the timeout, resetting the client then retrying. This is officially frowned upon but may be our only option.

dpupek · July 6, 2020, 6:07pm

Has their been any word on this? I feel like all the time being spent on dealing with timeout issues and how to get around them are related to this.

Topic		Replies	Views
SDK operation timeouts connecting with the cluster .NET SDK dot-net	7	2749	April 29, 2017
OperationTimeout from the .NET SDK... but that seems unlikely .NET SDK sdk , dot-net	2	2254	September 18, 2017
The operation has timed out .NET SDK	15	2859	November 6, 2020
Operation times out .NET SDK	8	5001	April 4, 2018
OperationTimeout setting seems to be ignored by the .NET SDK 2.0.2 .NET SDK	1	2411	February 17, 2015

SDK 2.7 returns OperationTimeout error status during server graceful failover or cluster changes

Related topics