The operation has timed out

Hi All,
I’ve encountered a sistematic operation time out during our .net core 3.1 application lificycle. We’re using sdk 2.7.20 and couchbase Enterprise Edition 6.5.0 build 4960 (the cluster is made of just one node). The exception reports Couchbase.IO.SendTimeoutExpiredException.

I’ve digged a lot into the available documentation and posts. I can confirm that there aren’t any network problem bacause I’ve reproduce the same error both on lan server and localhost docker container. Currently I’ve the following configuration (I’ve tried to fix the problem increasing the pool max size):

var config = new ClientConfiguration(clientSection)
            {
                PoolConfiguration = new PoolConfiguration
                {
                    MaxSize = 20,
                    MinSize = 20,
                    SendTimeout = 12000
                },

                // assign one or more Couchbase Server URIs available for bootstrap
                Servers = new List<Uri> {
                    new Uri(server)
                },
                UseSsl = false,
                BucketConfigs = new Dictionary<string, BucketConfiguration>
                {
                    {bucketName, new BucketConfiguration
                        {
                            UseEnhancedDurability = true,
                            BucketName = bucketName,
                            Password = "",
                            UseSsl = false
                        }
                    },                
                }
            };

            // initialize Cluster, to be built as a singleton
            ClusterHelper.Initialize(config); 

launched at program startup.

Inside our repository (injected by DI) we get an instance of the bucket using ClusterHelper.GetBucket.
The timeout occurs using a test cycle that consume the web.api after having inserted 6 docs.

Maybe upgrading the sdk version (up to 3) may lead to a more stable situation? Do you have any suggestion on the current sdk (2.7.20)?

Thanks for your help,
Dario

@Dario_Mazza

How much load are you putting on the cluster in your test? I would expect that timeout to happen only under a very high level of strain.

Can you show the code that’s actually using the IBucket? Maybe I can spot something there.

1 Like

The load is low, we’re in the middle of dev phase so the load is just my integration test activity: ten web.api calls and two upserts per each. The timeout happens after just 2 web.api calls.

Our component stack is very common and it contains:
Controller
–> Service
– Repository

The integration test above uses two different services and repositories starting from the same action in the entry controller.

Every repository inherit from a base repository. The base repository constructor is the following:

public CouchbaseRepository(IConfiguration configuration)
{
            Configuration = configuration;
            string bucketName = Configuration.GetSection("CouchbaseSettings:BucketName").Value;
            Bucket = ClusterHelper.GetBucket(bucketName);
            BucketContext = new BucketContext(Bucket);

}

As you can see we use the GetBucket to obtain the bucket object. Consider that the bucketName is always the same, So we use just one bucket.
The service and the repository are injected through .net core DI in the class constructor.
It follows the save method defined in the base repository:

public async Task<T> SaveAsync(T item)
        {
            if (item.Id == null)
            {
                item.Created = DateTime.Now;
                // al baseId appendo un progressivo: (baseId può essere "" )
                item.Id = await AppendKeyIncrement();
            }
            else if (!item.Created.HasValue)
                throw new InvalidOperationException("Existing entity must have a creation date");

            item.Updated = DateTime.Now;

            // prefisso l'id con il type
            var key = ComputeKey(item.Id);

            var result = await Bucket.UpsertAsync(key, item);

            if (!result.Success)
            {
                throw result.Exception;
            }

            return await GetByIdAsync(item.Id);
        } 

Thanks for the kind help,
Bet regards

@Dario_Mazza

Unfortunately, I don’t see anything in that code that jumps out at me as a problem. It looks okay, so the problem must be elsewhere.

That said, I do see a few things that I’d like to point out to you that may help.

  1. If this is a new project, I’d seriously consider using SDK 3.x instead of 2.x. While 2.x is still supported, for a new project it makes sense to start with the latest to avoid costly refactors later. It also has a lot of improvements that make it easier to use in a lot of ways.
  2. I’d look at using this NuGet package to help with your DI, especially the support for named bucket providers. That will cleanup your constructor injection logic on the repository a lot. https://www.nuget.org/packages/Couchbase.Extensions.DependencyInjection/2.0.2 Note: You must use 2.0.2 for 2.7 SDK compatibility.
  3. Your call to throw result.Exception isn’t safe in the 2.x SDK. That value may be null, depending on the type of error that caused the request to fail. Instead, use result.EnsureSuccess(), which will throw a correct exception for you only on failure.
  4. I generally recommend including .ConfigureAwait(false) on all awaits within repositories.
  5. Calling a Get immediately after and Upsert is probably an unnecessary performance penalty unless you have more advanced logic in GetByIdAsync that you need to be hit.

Perhaps you could send me the code for AppendKeyIncrement() and GetByIdAsync()? Perhaps the problem is there?

1 Like

Thanks for the suggestion, we’ll take care of it.
In the meantime I send you the two function bodies.

        private async Task<string> AppendKeyIncrement()
        {
            // const string counterKey = "global::counter";
            var counterKey = $"{GetTypeName()}::counter";
            IOperationResult<ulong> result = await Bucket.IncrementAsync(counterKey, 1, 1);

            return result.Value.ToString();
        }

        public async Task<T> GetByIdAsync(string id, bool throwExceptionIfNotExisting = true)
        {
            var key = ComputeKey(id);

            IOperationResult<T> result = await Bucket.GetAsync<T>(key);

            if (!result.Success)
            {
                // deal with error
                return null;
            }

            if (throwExceptionIfNotExisting && result.Value == null)
                throw new ArgumentException("The requested document cannot be found");

            return result.Value;
        }

        public virtual string ComputeKey(string id)
        {
            return string.Format("{0}::{1}", GetTypeName(), id);
        }

Thanks,
Regards

1 Like

I see a couple more things, but again nothing major.

  1. You’re not checking for errors on the call to IncrementAsync
  2. In the case of a document not being found, result.Success will be false. So you’ll never reach the logic around throwExceptionIfNotExisting. Instead of looking for null, look for result.Status of KeyNotFound.

How many of these operations are you trying to perform simultaneously (in parallel)?

1 Like

More or less after 4 save and 10 read the system gets the timeout.

Yes, but how are you generating that load? Is it sequential, one after the other? Or are you trying to do 1000 operations in parallel?

Also, what is your cluster topology? A single Couchbase node?

The load it’s a sequence. I’ve updated PoolConfiguration.MaxSize up to 200. I’ve increase my cycle from 10 to 100 and I’ve discovered that now the system is able to perform 25 steps instead of the previous 3. So I guess that the tcp sockets are allocated by the thread but not reused by the same during the test life cycle. I can see this from TCPView:

Is there any PoolConfiguration or something else that I can do to let the sockets to be reused once the data has been read/written? Moreover, the same sockets will be usable by another thread? Now I’m performing a unique thread test, but in a real environment there are concurrent requests to serve.

Thanks a lot for you help,
Regards

Yeah, something about this still doesn’t make sense.

  1. A single connection is capable of processing multiple operations at once via multiplexing
  2. Every connection in the pool is meant to be reused

We have dozens of different apps using this version of the SDK without problems like this, just at our company, not to mention other users. So your issue must be either configuration-related or environment-related.

Can you create a application that reproduces this issue in a public GitHub repository so I can view an entire application?

1 Like

May I share the project via wetransfer sending the link to your email?

I’ve created a test project that contains few dependency but generate the same error.

Thanks,
Regards

Sure, my email is bburnett at centeredgesoftware.com

@Dario_Mazza -

Can you try running the SDK doctor? https://docs.couchbase.com/server/current/sdk/sdk-doctor.html

It’s a very useful tool for diagnosing timeout issues.

-Jeff

Hi All,
I’ve dedicated one more hour of time yesterday afternoon debugging every single operation and I’ve found the root cause (as you’ve reported in other post “the timeout is just a symtom”).

The problem is related to a bug in our code that generate a search key of wrong format, someting like this as metad().id:

componentLoad::11467#e2dNmmCWZe+UM7m0qCPrYLn7nPGu9oi3QDUD6/8oGDFsMDxoOyvAWBf3YqVN/vU6tCWZhHqproDTS8/F+U9OFEQAapyUrJvZXrF7iNAVBZw2S4sH/Ux8a/yM3e5zKUwxGqoDO52+pGrH4qNl84Rxir0c8yFBSTalFTd1TGkshIz8v0twRuBFqbTj+jRkWPSw76G5OcoJ4gjKGJm7jFIuMhhKbQoY3NBCpop2WgE7jNSGH74ZKbGGqF4/QrYGpt67ZFP3MZf1P1zlllA+70j3knVpfqkNGs4SQXDbMdyq1DfKP+Ogj4j6j75CM+QTbm6WR/JGWhVFzzGmuaUGF/sGFwumBOyfI4FT0Ox+KbnczB/aw+RX52bxY5XiVQ80fsP6Jr53QTg++4uuh9iQHSWbOGEkTdZL2fdGPjdBkzxlLW3zbkqc0RV2rKUgqiijgNNTx+fbQiRvkUGiQPlD7c9Oxdy5zl937+MggrP1MdxTUCt806prNj4q/MqniZCCGQ0vvh5ixKXCbAVjP2UV9SzINxgwobadrVGugvO1HSUt8yFFnIODQ47wHss6jCswBW9uzEr3nLn0hCnNxQ4YgRwl+3G6HmqAB2wvtAVibjnugqcok8/VeQRJtFmH8+SE1hQb

So, the invocation of IOperationResult result = bucket.Get(key); generate a result message like:
“KV Error: {Name=“EINVAL”, Description=“Invalid packet”, Attributes=“internal,invalid-input”}”

At this point it seems that all the already used sockets were blocked. So, the unit test go on until sature all the available sockets in the pool issuing then the timeout.

Now we’ve fixed the bug but I’m wondering if is there a way to avoid a socket stale like that, maybe via configuration (e.g. closing the stale socket and opening a new one).

Thanks,
Regards

1 Like

@Dario_Mazza

That key is way over the maximum key length of 250 bytes, so that makes sense. However, it should definitely be handled more robustly. I’ve filed an issue to track this.

2 Likes

Ok, Thank a lot!

Dario