Nuget Version 2.7.10 : Getting Timeout after 15 seconds specific to one bucket

Hello,

We are using Couchbase client 2.7.10 for DotNet.
Couchbase is being used in Production environment since 2.5 years now.
Recently we started getting timeouts (after 15 secs) while consuming GetAsync() method, same observed for BucketExists() as well.
Also, interestingly this is only happening with one of our 6 buckets. All other bucket calls are working fine.
App server & Couchbase server are having good health and no issues in network bandwidth.
The app servers and the cluster is in the same LAN and both are VMs.
Need a quick help to find reason for such behaviour or a solution if someone came across this issue.

– Kalyan

Hello @Kalyan_Dixit welcome to the group, @jmorris can you please assist ?

@Kalyan_Dixit -

A couple of things to point out:

  • Timeouts are always a side effect; you’ll want to determine if the timeout is happening on the app server, on the network or on the CB server. Logging can help identify where the timeout is occurring.

  • There has been a dozen releases since 2.7.10; I would update to the latest 2.7.22 and see if you encountered a bug that has been fixed.

-Jeff

Thanks for the reply Jeff.

  • We have logging in place and I think the issue is related to the DotNet client we are using to communicate with Couch as other buckets work perfectly fine when we face this issue.

  • Secondly, yes we have planned to update to 2.7.22 today. I will post here if issue persists.

–Kalyan

Hi Jeff,

We did update to 2.7.22 but it is still intermittently returning results after 15 seconds for GetAsync of 1 document. If I restart my application same key get call takes less than 100 ms, but after some time it starts getting stuck.
Any possible reasons for such behavior?

–Kalyan

Hi Jeff,

Any update on this?

–Kalyan

As I stated, timeouts are nearly always a side-effect, so the logs can help to determine what is causing the timeout. Post the logs or send them to me directly.

A couple more questions:

  • What is the load you are running?

  • What is the document size?

  • TBH 15s is a very long timeout duration; the default is 2.5s and if your timing out there something is wrong and its time to investigate.

-Jeff

Thanks for the reply Jeff. Below are the answers to your questions,

  • Currently we are just hitting once a minute for debugging purpose since it is having issue (still we are getting it with such low frequency), but expected load when live in PROD could rise to 500+ per second.

  • We are storing compressed data in bytes, apprx size is 1 MB which we are testing.

  • When it runs well, response time is less than 100 ms. Also other buckets are not having such issue which are having 1 million+ items and good amount of fetch/sec in PROD.

We are using DotNet Core API and DotNet standard 2.0 for our wrapper dll having this nuget.
Any specific kind of logging you want us to try out?

–Kalyan

One possibility is it’s related to the buffers where we build operations. We cache and reuse these buffers (up to 32 of them) to reduce allocations on the LOH. However, to ensure we don’t use too much RAM we don’t reuse buffers greater than 1MB in size. We dispose of those. My original thought was that we might make these thresholds tunable via configuration if there was a need, but it isn’t currently tunable.

Since you’re doing approx 1MB operations, some of them may be exceeding that threshold and causing the LOH to thrash. I can’t say for sure, it’s just a loose theory, but maybe worth checking. Unfortunately, there isn’t any logging around disposing of these buffers because they are global and outside the scope of logging.

Possible tests:

  1. Make all operations less than 1MB (not exactly 1MB, we need 30-ish bytes for headers)
  2. Adjust the thresholds in the linked file above and build from source
  3. Use a memory tracing tool like JetBrains dotMemory to see how much this is happening and if garbage collection might be the culprit.

Actually, scratch all of that. I missed that you were referring to the 2.7 SDK, this was a new construct added in 3.0.

@Kalyan_Dixit -

You need to follow directions on the link I sent you to enable SDK client-side logging; pick whichever logging provider you want to use. This will show us what is happening internally within the SDK and help us isolate what is causing the issue.

This makes me think something on the network (load balancer, net appliance, etc) is treating this as idle connection and terminating it.

-Jeff

Thanks for the reply @btburnett3.
We have a plan to upgrade to 3.1 actually, but wanted to fix this as of now due to time crunch.

Sure @jmorris, we will try this logging. Just to highlight we had the BucketExist check befor GetAsync call and suprisingly it was also intermittently taking 15 seconds to respond.
Also, all this is happening for one bucket only. Other buckets use same code to communicate to couchdb but never had any issue (mentioned this as you suspected network, load balancer & net appliance causing this).

@Kalyan_Dixit -

It might be Couchbase server and the bucket itself, the network, or the SDK; however, without providing the information requested its impossible to isolate the issue.

Hi @jmorris,

I have captured the logs as suggested, but unable to upload the file. Getting message "Sorry, new users can not upload attachments. "
Please let me know how can I upload zip file for detailed logs.

Meanwhile, below are the few log snippet,
Note : I have masked the IP addresses and the Couch Username.
BucketName : “PROD_SupplierInvCache”

1.Client log showing error message:

     2020/12/10 12:46:39.125|DEBUG|Creating bucket refCount# 81 |Couchbase.CouchbaseBucket|
      2020/12/10 12:46:39.125|DEBUG|Bootstraping was already done, returning existing bucket PROD_SupplierInvCache |Couchbase.Core.ClusterController|
      2020/12/10 12:46:39.125|TRACE|Getting KeyMapper for rev#3909 on thread 14 |Couchbase.Configuration.ConfigContextBase|
      2020/12/10 12:46:39.125|TRACE|Using index 286 for key 912515121110219416 - rev3909 |Couchbase.Core.Buckets.VBucketKeyMapper|
      2020/12/10 12:46:39.125|TRACE|1. Checking authentication [True|True]: y.y.y.5:11210 - 48295cd4-9c88-40d6-bc20-7aab0506689a |Couchbase.IO.ConnectionPoolBase|
      2020/12/10 12:46:39.125|TRACE|Using conn 48295cd4-9c88-40d6-bc20-7aab0506689a on y.y.y.5:11210 |Couchbase.IO.Services.SharedPooledIOService|
      2020/12/10 12:46:39.125|TRACE|Using conn 48295cd4-9c88-40d6-bc20-7aab0506689a on y.y.y.5:11210 |Couchbase.IO.Services.PooledIOService|
      2020/12/10 12:46:39.125|INFO|Checking if node y.y.y.5:11210 should be down - last: 12:43:18.7468262, current: 12:46:39.1358450, count: 1 |Couchbase.Core.Server|
      2020/12/10 12:46:54.084|INFO|The operation has timed out. ["s":"kv","i":"f4","c":"399968f7f4600607/fd725527bae15dee","b":"PROD_XmlRateIdStore","l":"x.x.x.201:54670","r":"y.y.y.7:11210","t":15000000] |Couchbase.IO.ConnectionBase|
      2020/12/10 12:46:54.084|WARN|Handling disconnect for connection b1d0ef07-ae19-4169-b989-e0becca426df: Couchbase.IO.SendTimeoutExpiredException: The operation has timed out. ["s":"kv","i":"f4","c":"399968f7f4600607/fd725527bae15dee","b":"PROD_XmlRateIdStore","l":"x.x.x.201:54670","r":"y.y.y.7:11210","t":15000000]
         at System.Environment.get_StackTrace()
         at Couchbase.IO.SendTimeoutExpiredException..ctor(String message)
         at Couchbase.IO.ConnectionBase.CreateTimeoutException(UInt32 opaque)
         at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
         at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)
         at Couchbase.Core.Server.Send[T](IOperation`1 operation)
         at Couchbase.Configuration.Server.Monitoring.ConfigMonitor.<StartMonitoring>b__12_0()
         at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
         at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
         at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
         at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
         at System.Threading.Tasks.Task.TrySetResult()
         at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
         at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
         at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
         at System.Threading.TimerQueue.FireNextTimers() |Couchbase.IO.ConnectionBase|
      2020/12/10 12:46:54.093|INFO|Checking if node y.y.y.7:11210 should be down - last: 12:43:24.5495488, current: 12:46:54.0936121, count: 2 |Couchbase.Core.Server|
      2020/12/10 12:46:54.093|DEBUG|Waiting to check configs... |Couchbase.Configuration.Server.Monitoring.ConfigMonitor|
      2020/12/10 12:46:54.140|INFO|The operation has timed out. ["s":"kv","i":"f7","c":"399968f7f4600607/77f49d592e30611b","b":"PROD_SupplierInvCache","l":"x.x.x.201:54647","r":"y.y.y.5:11210","t":15000000] |Couchbase.IO.ConnectionBase|

2. GetAsync() method response when timed out: (This is a custom log from our app):
{
StatusCode: null,
ClassName: “CouchCacheHelper”,
MethodName: "Timeout Result : ",
HostName: “EON-SALENDRIY”,
ApplicationName: “CoreCache-MG.Jarvis.Core.Cache”,
Message: “{“Value”:null,“Success”:false,“Message”:“Theoperationhastimedout.[
“s”: “kv”,
“i”: “f3”,
“c”: “399968f7f4600607/2cba4800923db510”,
“b”: “PROD_SupplierInvCache”,
“l”: “y.y.y.y:54651”,
“r”: “x.x.x.x:11210”,
“t”: 15000000
]”,“Token”:{“VBucketId”:-1,“VBucketUUID”:-1,“SequenceNumber”:-1,“BucketRef”:null},“Cas”:0,“Status”:512,“Durability”:0,“Exception”:{“StackTrace”:“atSystem.Environment.get_StackTrace()\r\natCouchbase.IO.SendTimeoutExpiredException…ctor(Stringmessage)\r\natCouchbase.IO.ConnectionBase.CreateTimeoutException(UInt32opaque)\r\natCouchbase.IO.MultiplexingConnection.<>c__DisplayClass7_0.b__0(Objecto)\r\natSystem.Threading.ExecutionContext.RunInternal(ExecutionContextexecutionContext,
ContextCallbackcallback,
Objectstate)\r\natSystem.Threading.TimerQueueTimer.CallCallback(BooleanisThreadPool)\r\natSystem.Threading.TimerQueueTimer.Fire(BooleanisThreadPool)\r\natSystem.Threading.TimerQueue.FireNextTimers()”,“Message”:“Theoperationhastimedout.[
“s”: “kv”,
“i”: “f3”,
“c”: “399968f7f4600607/2cba4800923db510”,
“b”: “PROD_SupplierInvCache”,
“l”: “y.y.y.y:54651”,
“r”: “x.x.x.x:11210”,
“t”: 15000000
]”,“Data”:{“ServiceType”:“kv”,“OperationId”:“f3”,“Bucket”:“PROD_SupplierInvCache”,“ConnectionId”:“399968f7f4600607/2cba4800923db510”,“LocalEndpoint”:“y.y.y.y: 54651”,“RemoteEndpoint”:“x.x.x.x: 11210”,“Timeout”:15000000},“InnerException”:null,“HelpLink”:null,“Source”:“Couchbase.NetClient”,“HResult”:-2146233083},“Id”:“912515121110219403”,“OpCode”:0}”,
InputParameters: [

]
}

I hope this is what you wanted and will be helpful to analyze.

–Kalyan

That is a “hard” timeout as the client is waiting for a response from the server and it never happens. I would start investigating the network and/or server; most likely like said earlier, its a LB or something else closing idle connections.

-Jeff