Couchbase.Core.Retry.BestEffortRetryStrategy Timeout Error

obawin · March 28, 2022, 8:30pm

Hi,

We seem to be getting intermittent timeout errors when using SDK 3.2.8 for KV Gets (ICouchbaseCollection.GetAsync), but aren’t sure if these are caused by something that’s changed in the SDK or if our environment is changed/slower.

From the logs:

timeout value = 2.5s (00:00:02.5000000)
When timeout occurs, Couchbase.Core.Retry.BestEffortRetryStrategy kicks in and retries up to 6 times
After failing the 6th retry, a Couchbase.Core.Exceptions.UnambiguousTimeoutException is thrown

Questions:

Has anything changed in regards to KV Get timeouts and/or retries from either SDK 2.X->SDK3.X or in the minor versions SDK 3.2.X->SDK 3.2.Y? Example: Timeout default value, whether timeouts enabled/disabled, retry strategy, etc.
Is 2.5s default timeout value the same as SDK 2.X? In SDK 3.2.8, it seems to come from:
public TimeSpan KvTimeout
{
get;
set;
} = TimeSpan.FromSeconds(2.5);
If we wanted to change or disable the retries, how would we go about that?

Thanks for any help on this topic.

PS: I think ideally, we would like things to behave as it did previously in SDK 2.X.

btburnett3 · March 28, 2022, 9:02pm

Are you certain that the inner problem is a timeout? The retry pattern shouldn’t be based on getting a timeout, the timeout is an outer wrapper around the retries. So, while it may retry, it should fail with a TimeoutException (either Ambiguous or Unambiguous) after 2.5 seconds. The only errors which are retried are errors where the matching CouchbaseException implements the IRetryable interface.

However, to answer your question there was a significant rewrite around retries between 2.x and 3.x, though I can’t personally speak to all the specifics about how the overall behavior changed.

Within the 3.2.x minor releases there have been some changes to retries, but nothing that should have significantly affected behaviors. Just some tweaks around things like scope/collection ID refreshes and reusing CancellationTokenSource in .NET 6 and other improvements to reduce heap allocations.

btburnett3 · March 28, 2022, 9:08pm

My apologies, I missed a couple of your other questions:

The default OperationLifespan in SDK 2.0 was also 2.5 seconds
You can supply a different retry strategy using ClusterOptions.WithRetryStrategry. To disable retries, use the FailFastRetryStrategy. However, I do see that is an internal member, not sure why. You can just copy that logic locally. couchbase-net-client/FailFastRetryStrategy.cs at master · couchbase/couchbase-net-client · GitHub

obawin · March 28, 2022, 9:24pm

Np. Thanks for the replies.

We’re not entirely sure whether the issue was from a change in the timeout/retry or something else. Given that the default KV timeout stayed the same in SDK 2.X and SDK 3.X at 2.5s, it points to probably something else?

We have custom retry logic, so changing the retry strategy to a class created from FailFastRetryStrategy is something we’ll do to keep it consistent for how it worked before.

I do recall there were some timeout related issues fixed in the later SDK 2.X versions dealing with incorrect timeout values being used (can’t recall if it applied to KV)? I wonder if the oberved issues comes from us previously using an older SDK 2.X, which in turn incorrectly had timeouts larger than 2.5s?

obawin · March 28, 2022, 9:27pm

Oh, found the fixed issue (fixed in 2.7.24):

https://review.couchbase.org/c/couchbase-net-client/+/148125

btburnett3 · March 30, 2022, 12:09pm

@obawin

I just found out there is a bug in the handling of the exponential backoff for retries that might explain some of what you’ve been seeing.

https://issues.couchbase.com/browse/NCBC-3176

This has been fixed for the next release of the SDK.

obawin · March 30, 2022, 4:34pm

Excellent! Thank-you @btburnett3 (and @Richard_Ponton for the fix)!

obawin · April 11, 2022, 6:44pm

hello @btburnett3 ,

I was looking at the CB SDK code, and see that the RetryOrchestrator has this:

                    var strategy = request.RetryStrategy;
                    var action = strategy.RetryAfter(request, reason);
                    if (action.Retry)
                    {
                            ...
                    }

To disable CB SDK retries, I created a copy of FailFastRetryStrategy and made that the retry strategy:

internal class FailFastRetryStrategy : IRetryStrategy
{
    public RetryAction RetryAfter(IRequest request, RetryReason reason)
    {
        return RetryAction.Duration(null);
    }
}

However, I noticed that the RetryAction class has the following:

    public bool Retry => DurationValue.HasValue;

    public RetryAction(TimeSpan? duration)
    {
        DurationValue = duration;
    }

Given the above pieces of code, it looks like it won’t cause retries to be “disabled”, and instead, will cause retries to run continuosly? Did I analyze things correct?

UPDATE: Actually, I noticed the above behaviour is only for this method:
public async Task RetryAsync(Func<Task> send, IRequest request) where T : IServiceResult

The other Retry method in the class does indeed fail fast (i.e. disable retries):
public async Task RetryAsync(BucketBase bucket, IOperation operation, CancellationTokenPair tokenPair = default)

Seems the 2nd is for KV queries and 1st is for things like N1QL, etc.? Does the 1st need to be changed/fixed?

Also:

would it make sense to add a cluster level “Disable Retries” config, and if false, it would check / throw the original exception before the “if (reason.AlwaysRetry())” call in the RetryOrchestrator?
have a map from RetryReason to CouchbaseException, so that a relevant exception type would be thrown in the non-KV retry orchestrator method when retries are disabled or when it is a fast retry failure?

btburnett3 · April 12, 2022, 3:13pm

@obawin

Perhaps I’m missing something, but I don’t see the problem in that routine? If you’re on the fail-fast strategy it will return an action with a null duration. Then Retry will be false, so it won’t retry.

obawin · April 12, 2022, 4:11pm

Thanks for looking. Yes, it’s possible there’s no problems and I mis-analyzed (that’s why a second pair of eyes is good!).

Here’s the code flow of interest:

public async Task RetryAsync(Func<Task> send, IRequest request) where T : IServiceResult
{
…

            do
            {
                try
                {
                    var result = await send().ConfigureAwait(false);
                    var reason = result.RetryReason;
                    if (reason == RetryReason.NoRetry) return result;

                    ....

                    var strategy = request.RetryStrategy;
                    var action = strategy.RetryAfter(request, reason);
                    if (action.Retry)
                    {
                        ...
                    }

                    //**** FAIL FAST WILL HAVE action.Retry=FALSE, WHICH CAN CAUSE IT TO LOOP INDEFINITELY IF THE SAME ERROR/RETRY REASON OCCURS?
                }
                catch (TaskCanceledException _)
                {
                    ...
                }
            } while (true);

}

obawin · April 26, 2022, 2:56am

Hi @btburnett3 and @Richard_Ponton, just wanted to check with you regarding the use of the FailFast retry strategy in N1QL queries.

Is the above assessment correct that it could be an issue (infinite retries in some situations)? If so, do you wish for me to file an issue ticket for tracking?

Regards.

eldridge.martinez · April 27, 2022, 3:26am

hi @btburnett3,

I have exactly the same issue with @obawin . I used the most latest nuget version 3.2.9 and still seeing the timeout error.

Sample error message is: The operation /54379 timed out after 00:00:02.5000000. It was retried 1 times using Couchbase.Core.Retry.BestEffortRetryStrategy.

eldridge.martinez · April 27, 2022, 4:21am

Further to this, when timeout exception is throw, there is nothing in Couchbase dashboard that tells me how many of those. I’m referring to the below in Dashboard.

Looking at the RetryOrchestrator.cs method public async Task RetryAsync(BucketBase bucket, IOperation operation, CancellationTokenPair tokenPair = default)… One scenario that it throws the OperationCanceledException is when the tokenPair.ThrowIfCancellationRequested(); is thrown. What could be the reasons this cancellation is thrown or being stopped.

jmorris · April 27, 2022, 3:45pm

@eldridge.martinez

A timeout can happen for any number of reasons. The fact the retry count is 1 makes me think it timed out before leaving the client.

I would enable logging and see what’s happening internally; also, create a new topic as this topic is resolved.

Jeff

obawin · May 5, 2022, 4:49pm

Hi @Richard_Ponton,

I created an issue for the potential FailFast infinite loop problem:
Loading... - “FailFast Retry Strategy May Result in Infinite Processing Loop for Query, Views, Analytics, Search requests”

If it’s not an actual issue, please feel free to close it.

I also have an implementation that I think should resolve the issue. I’ll try getting the code into a review cycle for you to look at today sometime (can’t recall the process involved).

Thanks.

Review is here: https://review.couchbase.org/c/couchbase-net-client/+/174537

Topic		Replies	Views
Using N1QL, Timeout .NET SDK connections , dot-net	22	5918	August 14, 2015
Couchbase NodeUnavilable ,.Net SDK .NET SDK dot-net	3	2200	November 11, 2015
Getting Timeout error while Querying on large data bucket SQL++ n1ql	5	3917	January 7, 2016
java.util.concurrent.TimeoutException on get and upsert operations Java SDK	1	2367	August 22, 2016
Couchbase .NET SDK 2.2.4 Timeouts .NET SDK dot-net	1	1831	January 16, 2016

Couchbase.Core.Retry.BestEffortRetryStrategy Timeout Error

Related topics