Couchbase.Core.Retry.BestEffortRetryStrategy Timeout Error

Hi,

We seem to be getting intermittent timeout errors when using SDK 3.2.8 for KV Gets (ICouchbaseCollection.GetAsync), but aren’t sure if these are caused by something that’s changed in the SDK or if our environment is changed/slower.

From the logs:

  • timeout value = 2.5s (00:00:02.5000000)
  • When timeout occurs, Couchbase.Core.Retry.BestEffortRetryStrategy kicks in and retries up to 6 times
  • After failing the 6th retry, a Couchbase.Core.Exceptions.UnambiguousTimeoutException is thrown

Questions:

  • Has anything changed in regards to KV Get timeouts and/or retries from either SDK 2.X->SDK3.X or in the minor versions SDK 3.2.X->SDK 3.2.Y? Example: Timeout default value, whether timeouts enabled/disabled, retry strategy, etc.
  • Is 2.5s default timeout value the same as SDK 2.X? In SDK 3.2.8, it seems to come from:
    public TimeSpan KvTimeout
    {
    get;
    set;
    } = TimeSpan.FromSeconds(2.5);
  • If we wanted to change or disable the retries, how would we go about that?

Thanks for any help on this topic.

PS: I think ideally, we would like things to behave as it did previously in SDK 2.X.

Are you certain that the inner problem is a timeout? The retry pattern shouldn’t be based on getting a timeout, the timeout is an outer wrapper around the retries. So, while it may retry, it should fail with a TimeoutException (either Ambiguous or Unambiguous) after 2.5 seconds. The only errors which are retried are errors where the matching CouchbaseException implements the IRetryable interface.

However, to answer your question there was a significant rewrite around retries between 2.x and 3.x, though I can’t personally speak to all the specifics about how the overall behavior changed.

Within the 3.2.x minor releases there have been some changes to retries, but nothing that should have significantly affected behaviors. Just some tweaks around things like scope/collection ID refreshes and reusing CancellationTokenSource in .NET 6 and other improvements to reduce heap allocations.

My apologies, I missed a couple of your other questions:

Np. Thanks for the replies.

We’re not entirely sure whether the issue was from a change in the timeout/retry or something else. Given that the default KV timeout stayed the same in SDK 2.X and SDK 3.X at 2.5s, it points to probably something else?

We have custom retry logic, so changing the retry strategy to a class created from FailFastRetryStrategy is something we’ll do to keep it consistent for how it worked before.

I do recall there were some timeout related issues fixed in the later SDK 2.X versions dealing with incorrect timeout values being used (can’t recall if it applied to KV)? I wonder if the oberved issues comes from us previously using an older SDK 2.X, which in turn incorrectly had timeouts larger than 2.5s?

Oh, found the fixed issue (fixed in 2.7.24):

https://review.couchbase.org/c/couchbase-net-client/+/148125

@obawin

I just found out there is a bug in the handling of the exponential backoff for retries that might explain some of what you’ve been seeing.

https://issues.couchbase.com/browse/NCBC-3176

This has been fixed for the next release of the SDK.

Excellent! Thank-you @btburnett3 (and @Richard_Ponton for the fix)!

1 Like

hello @btburnett3 ,

I was looking at the CB SDK code, and see that the RetryOrchestrator has this:

                    var strategy = request.RetryStrategy;
                    var action = strategy.RetryAfter(request, reason);
                    if (action.Retry)
                    {
                            ...
                    }

To disable CB SDK retries, I created a copy of FailFastRetryStrategy and made that the retry strategy:

internal class FailFastRetryStrategy : IRetryStrategy
{
    public RetryAction RetryAfter(IRequest request, RetryReason reason)
    {
        return RetryAction.Duration(null);
    }
}

However, I noticed that the RetryAction class has the following:

    public bool Retry => DurationValue.HasValue;

    public RetryAction(TimeSpan? duration)
    {
        DurationValue = duration;
    }

Given the above pieces of code, it looks like it won’t cause retries to be “disabled”, and instead, will cause retries to run continuosly? Did I analyze things correct?

UPDATE: Actually, I noticed the above behaviour is only for this method:
public async Task RetryAsync(Func<Task> send, IRequest request) where T : IServiceResult

The other Retry method in the class does indeed fail fast (i.e. disable retries):
public async Task RetryAsync(BucketBase bucket, IOperation operation, CancellationTokenPair tokenPair = default)

Seems the 2nd is for KV queries and 1st is for things like N1QL, etc.? Does the 1st need to be changed/fixed?

Also:

  • would it make sense to add a cluster level “Disable Retries” config, and if false, it would check / throw the original exception before the “if (reason.AlwaysRetry())” call in the RetryOrchestrator?
  • have a map from RetryReason to CouchbaseException, so that a relevant exception type would be thrown in the non-KV retry orchestrator method when retries are disabled or when it is a fast retry failure?

@obawin

Perhaps I’m missing something, but I don’t see the problem in that routine? If you’re on the fail-fast strategy it will return an action with a null duration. Then Retry will be false, so it won’t retry.

Thanks for looking. Yes, it’s possible there’s no problems and I mis-analyzed (that’s why a second pair of eyes is good!).

Here’s the code flow of interest:

public async Task RetryAsync(Func<Task> send, IRequest request) where T : IServiceResult
{

            do
            {
                try
                {
                    var result = await send().ConfigureAwait(false);
                    var reason = result.RetryReason;
                    if (reason == RetryReason.NoRetry) return result;

                    ....

                    var strategy = request.RetryStrategy;
                    var action = strategy.RetryAfter(request, reason);
                    if (action.Retry)
                    {
                        ...
                    }

                    //**** FAIL FAST WILL HAVE action.Retry=FALSE, WHICH CAN CAUSE IT TO LOOP INDEFINITELY IF THE SAME ERROR/RETRY REASON OCCURS?
                }
                catch (TaskCanceledException _)
                {
                    ...
                }
            } while (true);

}

Hi @btburnett3 and @Richard_Ponton, just wanted to check with you regarding the use of the FailFast retry strategy in N1QL queries.

Is the above assessment correct that it could be an issue (infinite retries in some situations)? If so, do you wish for me to file an issue ticket for tracking?

Regards.

hi @btburnett3,

I have exactly the same issue with @obawin . I used the most latest nuget version 3.2.9 and still seeing the timeout error.

Sample error message is: The operation /54379 timed out after 00:00:02.5000000. It was retried 1 times using Couchbase.Core.Retry.BestEffortRetryStrategy.

Further to this, when timeout exception is throw, there is nothing in Couchbase dashboard that tells me how many of those. I’m referring to the below in Dashboard.

Looking at the RetryOrchestrator.cs method public async Task RetryAsync(BucketBase bucket, IOperation operation, CancellationTokenPair tokenPair = default)… One scenario that it throws the OperationCanceledException is when the tokenPair.ThrowIfCancellationRequested(); is thrown. What could be the reasons this cancellation is thrown or being stopped.

@eldridge.martinez

A timeout can happen for any number of reasons. The fact the retry count is 1 makes me think it timed out before leaving the client.

I would enable logging and see what’s happening internally; also, create a new topic as this topic is resolved.

Jeff

Hi @Richard_Ponton,

I created an issue for the potential FailFast infinite loop problem:
https://issues.couchbase.com/browse/NCBC-3197 - “FailFast Retry Strategy May Result in Infinite Processing Loop for Query, Views, Analytics, Search requests”

If it’s not an actual issue, please feel free to close it.

I also have an implementation that I think should resolve the issue. I’ll try getting the code into a review cycle for you to look at today sometime (can’t recall the process involved).

Thanks.

Review is here: https://review.couchbase.org/c/couchbase-net-client/+/174537

1 Like