@alexc - that timeout appears to be the client waiting on a response from the server. Since no other errors precede it, I suspect perhaps something network/environment related.
-Jeff
@alexc - that timeout appears to be the client waiting on a response from the server. Since no other errors precede it, I suspect perhaps something network/environment related.
-Jeff
We are receiving exactly the same error intermittently on our servers.
2018-01-16 20:56:20.588|STP SmartThreadPool Thread #7667|ERROR|Cruise.Cache.Base.Cache.CouchbaseCache|ExecuteUpsert|Upsert: False The operation has timed out.|key 83399d8b4ed55cb07fd22bee8b576b8ed46e7571|exception Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.-> at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)-> at Couchbase.IO.Services.MultiplexingIOService.Execute[T](IOperation`1 operation)
2018-01-16 20:56:20.588|STP SmartThreadPool Thread #7667|FATAL|Cruise.Cache.Base.Cache.CouchbaseCache|ExecuteUpsert|cache timeout and stopped working with error The operation has timed out. duration 15.0089136|83399d8b4ed55cb07fd22bee8b576b8ed46e7571| at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)-> at System.Environment.get_StackTrace()-> at Couchbase.IO.SendTimeoutExpiredException…ctor(String message)-> at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)-> at Couchbase.IO.Services.MultiplexingIOService.Execute[T](IOperation
1 operation)-> at Couchbase.Core.Server.Send[T](IOperation
1 operation)-> at Couchbase.Core.Buckets.MemcachedRequestExecuter.SendWithRetry[T](IOperation1 operation)-> at Couchbase.MemcachedBucket.Upsert[T](String key, T value, TimeSpan expiration, TimeSpan timeout)-> at Couchbase.MemcachedBucket.Upsert[T](String key, T value, TimeSpan expiration)-> at Cruise.Cache.Base.Cache.CouchbaseCache.ExecuteUpsert[TEntity](String ckey, TEntity valueToSave, TimeSpan timeToLive)-> at Amib.Threading.Internal.WorkItemsGroupBase.<>c__DisplayClass43_0
3.b__0(Object state) in M:\Dev\STP\STP.git\SmartThreadPool\WorkItemsGroupBase.cs:line 337-> at Amib.Threading.Internal.WorkItem.ExecuteWorkItem() in M:\Dev\STP\STP.git\SmartThreadPool\WorkItem.cs:line 381-> at Amib.Threading.Internal.WorkItem.Execute() in M:\Dev\STP\STP.git\SmartThreadPool\WorkItem.cs:line 314-> at Amib.Threading.SmartThreadPool.ExecuteWorkItem(WorkItem workItem) in M:\Dev\STP\STP.git\SmartThreadPool\SmartThreadPool.cs:line 910-> at Amib.Threading.SmartThreadPool.ProcessQueuedItems() in M:\Dev\STP\STP.git\SmartThreadPool\SmartThreadPool.cs:line 850-> at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)-> at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)-> at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)-> at System.Threading.ThreadHelper.ThreadStart()
After we start receiving that specific error, the client continue to give SendTimeoutExpiredException and continue to stop working until we recycle the pool.
The solution we implemented is to catch the SendTimeoutExpiredException and close and reopen the connections using ClusterHelper.Close and ClusterHelper.Inizialize
I suppose the is an error on the client that doesn’t reopen the connections when they loose something.
We are using Client SDK 2.5.3
A SendTimeoutExpiredException happens when the client times-out out waiting for a response to return from the server. Generally it is some kind of network or transport failure that causes this response to “hang”; often its a TCP RST. When the SDK generates this exception, it will check the state of the connection and recreate a new one if its no longer connected. You should not have to re-cycle the app pool or close and re-open the client by calling ClusterHelper.Close/Initialize; the SDK should repair itself.
Since this is just the exception that is generated when the request times out, it doesn’t contain enough information to determine why it timed out or what the internal state of the SDK is. To get this information you’ll need to enable logging on the SDK and grep through the logs (you can send me a dm with them and i’ll take a look as well).
Another useful tool is to run sdk-doctor to help diagnose any network issues between your app and the cluster.
-Jeff
Logging is enabled in our servers, at warn level, but no other exceptions has been thrown.
When the client start throwing a SendTimeoutExpiredException, it will continue throwing those exceptions and doesn’t reopen the connections. The only workaround we found was to force close and re-initialize the cluster.
I tried executing the sdk-doctor utility on our server (1 of 3) and I received come alerts:
sdk-doctor diagnose http://az-cah-couch001.crociere.lan:8091/pricing -u Administrator -p xxxxxx
09:59:42.464 INFO ? Parsing connection string
http://az-cah-couch001.crociere.l an:8091/pricing
09:59:42.501 WARN ? Connection string is using the deprecatedhttp://
scheme.
Use thecouchbase://
scheme instead!
09:59:42.502 INFO ? Connection string identifies the following CCCP endpoints:
09:59:42.502 INFO ? Connection string identifies the following HTTP endpoints:
09:59:42.502 INFO ? 1. az-cah-couch001.crociere.lan:8091
09:59:42.503 INFO ? Connection string specifies bucketpricing
09:59:42.503 WARN ? Your connection string specifies only a single host. You sh
ould consider adding additional static nodes from your cluster to this list to i
mprove your applications fault-tolerance
09:59:42.504 INFO ? Performing DNS lookup for hostaz-cah-couch001.crociere.lan
09:59:42.507 INFO ? Bootstrap hostaz-cah-couch001.crociere.lan
refers to a se
rver with the address10.70.4.21
09:59:42.507 INFO ? Not attempting CCCP, as the connection string does not suppo
rt it
09:59:42.508 INFO ? Attempting to connect to cluster via HTTP (Terse)
09:59:42.508 INFO ? Attempting to fetch terse config via http fromaz-cah-couch 001.crociere.lan:8091
09:59:42.524 WARN ? Bootstrap hostaz-cah-couch001.crociere.lan
is not using t
he canonical node hostname of10.70.4.21
. This is not neccessarily an error,
but has been known to result in strange and difficult-to-diagnose errors in the
future when routing gets changed.
09:59:42.525 INFO ? Identified the following nodes:
09:59:42.525 INFO ? [0] 10.70.4.21
09:59:42.526 INFO ? mgmt: 8091, indexStreamMaint: 9105
, capi: 8092
09:59:42.526 INFO ? moxi: 11211, indexAdmin: 9100
, indexScan: 9101
09:59:42.526 INFO ? indexHttp: 9102, indexStreamInit: 9103
, indexStreamCatchup: 9104
09:59:42.532 INFO ? projector: 9999, kv: 11210
, n1ql: 8093
09:59:42.534 INFO ? [1] 10.70.4.22
09:59:42.534 INFO ? mgmt: 8091, indexAdmin: 9100
, indexHttp: 9102
09:59:42.534 INFO ? indexStreamCatchup: 9104, capi: 8092
, projector: 9999
09:59:42.535 INFO ? n1ql: 8093, indexScan: 9101
, indexStreamInit: 9103
09:59:42.535 INFO ? indexStreamMaint: 9105, kv: 11210
, moxi: 11211
09:59:42.536 INFO ? [2] 10.70.4.23
09:59:42.536 INFO ? mgmt: 8091, indexScan: 9101
, indexStreamCatchup: 9104
09:59:42.536 INFO ? capi: 8092, moxi: 11211
, n1ql: 8093
09:59:42.542 INFO ? indexAdmin: 9100, indexHttp: 9102
, indexStreamInit: 9103
09:59:42.543 INFO ? indexStreamMaint: 9105, projector: 9999
, kv: 11210
09:59:42.544 WARN ? Your configuration was fetched via a non-optimal path, you s
hould update your connection string and/or cluster configuration to allow CCCP c
onfig fetch
09:59:42.544 INFO ? Fetching config from10.70.4.21:8091
09:59:42.547 INFO ? Failed to retreive cluster information (status code: 401)
09:59:42.549 ERRO ? Failed to connect to KV service at10.70.4.21:11210
(error
: invalid bucket name/password)
09:59:42.571 INFO ? Successfully connected to MGMT service at10.70.4.21:8091
09:59:42.581 INFO ? Successfully connected to CAPI service at10.70.4.21:8092
09:59:42.583 INFO ? Successfully connected to N1QL service at10.70.4.21:8093
09:59:42.591 ERRO ? Failed to connect to KV service at10.70.4.22:11210
(error
: invalid bucket name/password)
09:59:42.595 INFO ? Successfully connected to MGMT service at10.70.4.22:8091
09:59:42.605 INFO ? Successfully connected to CAPI service at10.70.4.22:8092
09:59:42.623 INFO ? Successfully connected to N1QL service at10.70.4.22:8093
09:59:42.631 ERRO ? Failed to connect to KV service at10.70.4.23:11210
(error
: invalid bucket name/password)
09:59:42.641 INFO ? Successfully connected to MGMT service at10.70.4.23:8091
09:59:42.647 INFO ? Successfully connected to CAPI service at10.70.4.23:8092
09:59:42.650 INFO ? Successfully connected to N1QL service at10.70.4.23:8093
09:59:42.652 WARN ? Failed to perform KV connection performance analysis on10. 70.4.21:11210
(error: %!d(string=invalid bucket name/password))
09:59:42.654 WARN ? Failed to perform KV connection performance analysis on10. 70.4.22:11210
(error: %!d(string=invalid bucket name/password))
09:59:42.657 WARN ? Failed to perform KV connection performance analysis on10. 70.4.23:11210
(error: %!d(string=invalid bucket name/password))
09:59:42.658 INFO ? Diagnostics completedSummary:
←[33m[WARN]←[0m Connection string is using the deprecatedhttp://
scheme. Use
thecouchbase://
scheme instead!
←[33m[WARN]←[0m Your connection string specifies only a single host. You should
consider adding additional static nodes from your cluster to this list to impro
ve your applications fault-tolerance
←[33m[WARN]←[0m Bootstrap hostaz-cah-couch001.crociere.lan
is not using the c
anonical node hostname of10.70.4.21
. This is not neccessarily an error, but
has been known to result in strange and difficult-to-diagnose errors in the futu
re when routing gets changed.
←[33m[WARN]←[0m Your configuration was fetched via a non-optimal path, you shoul
d update your connection string and/or cluster configuration to allow CCCP confi
g fetch
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on10.70.4 .21:11210
(error: %!d(string=invalid bucket name/password))
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on10.70.4 .22:11210
(error: %!d(string=invalid bucket name/password))
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on10.70.4 .23:11210
(error: %!d(string=invalid bucket name/password))
←[31m[ERRO]←[0m Failed to connect to KV service at10.70.4.21:11210
(error: in
valid bucket name/password)
←[31m[ERRO]←[0m Failed to connect to KV service at10.70.4.22:11210
(error: in
valid bucket name/password)
←[31m[ERRO]←[0m Failed to connect to KV service at10.70.4.23:11210
(error: in
valid bucket name/password)Found multiple issues, see listing above.
if I try connecting via couchbase:// I’m receiving an unexpected error:
sdk-doctor diagnose couchbase://az-cah-couch001.crociere.lan:8091/pricing -u Administrator -p xxxxxx
10:01:02.536 INFO ? Parsing connection string
couchbase://az-cah-couch001.croci ere.lan:8091/pricing
10:01:02.586 INFO ? Connection string identifies the following CCCP endpoints:
10:01:02.586 INFO ? 1. az-cah-couch001.crociere.lan:8091
10:01:02.586 INFO ? Connection string identifies the following HTTP endpoints:
10:01:02.586 INFO ? Connection string specifies bucketpricing
10:01:02.587 WARN ? Your connection string specifies only a single host. You sh
ould consider adding additional static nodes from your cluster to this list to i
mprove your applications fault-tolerance
10:01:02.587 INFO ? Performing DNS lookup for hostaz-cah-couch001.crociere.lan
10:01:02.591 INFO ? Bootstrap hostaz-cah-couch001.crociere.lan
refers to a se
rver with the address10.70.4.21
10:01:02.592 INFO ? Attempting to connect to cluster via CCCP
10:01:02.592 INFO ? Attempting to fetch config via cccp fromaz-cah-couch001.cr ociere.lan:8091
10:06:02.599 ERRO ? Failed to fetch configuration via cccp fromaz-cah-couch001 .crociere.lan:8091
(error: EOF)
10:06:02.600 INFO ? Not attempting HTTP (Terse), as the connection string does n
ot support it
10:06:02.600 INFO ? Not attempting HTTP (Full), as the connection string does no
t support it
10:06:02.601 ERRO ? All endpoints specified by your connection string were unrea
chable, further cluster diagnostics are not possible
10:06:02.601 INFO ? Diagnostics completedSummary:
←[33m[WARN]←[0m Your connection string specifies only a single host. You should
consider adding additional static nodes from your cluster to this list to impro
ve your applications fault-tolerance
←[31m[ERRO]←[0m Failed to fetch configuration via cccp fromaz-cah-couch001.cro ciere.lan:8091
(error: EOF)
←[31m[ERRO]←[0m All endpoints specified by your connection string were unreachab
le, further cluster diagnostics are not possible
Found multiple issues, see listing above.
I am asking for logging to be enabled within the SDK, not your application logs: Couchbase Server | Couchbase Docs
Is your bucket name and password correct in your configuration?
I suspect networking issues (possible DNS issue?); your node is not reachable. You’ll need to diagnose and resolve these issues. The timeouts are a side-effect of something not being correct between the application and the cluster.
Thanks for you response.
I know that the problem can be related to a networking problem (maybe a dns issue). I’m investigating about this issue.
What I don’t understand is why I have to catch the timeout exception and force a close and re-initialize of the cluster helper. This is a task that has to be done internally by the client.
If I don’t “restart” the connections on couchbase client it will continue throwing those SendTimeoutExpiredException forever, for every operation. The connections should be automatically reopened by the client.
Do you have some advise or investigating on that?
That is interesting because the SDK does indeed handle the creation/destruction of connections, this is how the client supports use-cases such as failover, node swapping and rebalances.
One thing to note is that when you call Initialize you are effectively re-bootstrapping the SDK instead of relying on the inner mechanisms which manage connections. The cluster tells the client via a cluster map (config) update if it has changed or not; however if the cluster doesn’t send a revision, then the cached IPEndPoints, IPAddress, etc maybe reused. That could be your problem, but not sure…grepping the SDK logs would help here.
Additionally, if the client detects too many failures coming from a node, it will go into a back-off state where it won’t send requests until it determines that it can connect to the node and receive NOOPs. In this case it will immediately fail the request and return a NodeUnavailableException until the client can successfully can connect and recieve NOOPs.
-Jeff
Hi All,
I am not sure if you have sorted out the issue?
We are also having the similar issue as @alexc mentioned at the beginning - the client complains operation has timed out and unless we restart the ClusterHelper, all the requests are remained timed out.
The version of .Net client is 2.5.5, and ClusterHelper (mentioned here: https://developer.couchbase.com/documentation/server/4.0/sdks/dotnet-2.2/cluster-helper.html) is adapted in our implementation.
Our setup is as follow:
<couchbaseClients>
<couchbase useSsl="false">
<servers>
<add uri="http://192.168.10.75:8091/pools"/>
</servers>
<buckets>
<add name="myBucket">
<connectionPool maxSize="10" minSize="5"/>
</add>
</buckets>
</couchbase>
</couchbaseClients>
We found the time out issu happens after “idle” for a while - meaning when we stop sending requests to the server for a while then we send again, time out happens.
Having read @jmorris 's suggestion to enable SDK log, we turned it on. After examining the log a little bit, we found the time out issue happens since the connection that was used before getting disconnected.
Here are part of the log:
[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase
Handling disconnect for connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f: Couchbase.IO.RemoteHostClosedException: The remote host (10.38.110.11:11210) has gracefully closed this connection.
[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase
Closing connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f
[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase
Handling disconnect for connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8: Couchbase.IO.RemoteHostClosedException: The remote host (10.38.110.11:11210) has gracefully closed this connection.
[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase
Closing connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8
...... Check config and processing config ......
[2018-03-16 13:04:33,549] Couchbase.Core.Server
Sending Get`1 with key PARBJS180510180526-100-JWZ-1002-Y using server 10.38.110.11:11210
[2018-03-16 13:04:33,549] Couchbase.IO.ConnectionBase
Handling disconnect for connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
[2018-03-16 13:04:33,549] Couchbase.Core.Server
Sending Get`1 with key PARBJS180510180526-100-JWZ-2007-Y using server 10.38.110.11:11210
[2018-03-16 13:04:33,586] Couchbase.IO.ConnectionBase
Handling disconnect for connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
...... Check config and processing config ......
[2018-03-16 13:04:48,588] Couchbase.IO.Services.PooledIOService
Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)
[2018-03-16 13:04:48,588] Couchbase.IO.Services.PooledIOService
Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
at Couchbase.IO.Services.IOServiceBase.Execute[T](IOperation`1 operation, IConnection connection)
at Couchbase.IO.Services.IOServiceBase.EnableServerFeatures(IConnection connection)
at Couchbase.IO.Services.IOServiceBase.CheckEnabledServerFeatures(IConnection connection)
at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)
[2018-03-16 13:04:48,588] Couchbase.Core.Server
Checking if node 10.38.110.11:11210 should be down - last: 12:59:15.2323782, current: 13:04:48.5880404, count: 1
[2018-03-16 13:04:48,588] Couchbase.Core.Server
Checking if node 10.38.110.11:11210 should be down - last: 12:59:15.2323782, current: 13:04:48.5880404, count: 2
[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.CouchbaseRequestExecuter
Operation doesn't support retries for key PARBJS180510180526-100-JWZ-2007-Y
[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.RequestExecuterBase
Operation for key PARBJS180510180526-100-JWZ-2007-Y failed after 1 retries using vb898 from rev69 and opaque326. Reason: The operation has timed out.
[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.CouchbaseRequestExecuter
Operation doesn't support retries for key PARBJS180510180526-100-JWZ-1002-Y
[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.RequestExecuterBase
Operation for key PARBJS180510180526-100-JWZ-1002-Y failed after 1 retries using vb989 from rev69 and opaque325. Reason: The operation has timed out.
Hope this helps to solve the problem.
Regards,
Xi
Do you have TCP keep-alive set, either programmatically and/or on the server
https://developer.couchbase.com/documentation/server/current/sdk/dotnet/client-settings.html
We had a few issues with timeouts when the firewall/F5 started closing connections it deemed inactive, having a 60s keep alive ping fixed that.
@clinton1ql Thank you for the tip.
We do not enable any keep live setting explicitly. However, I found the keep live is enabled by default and set to 2 hours according to the document.
The issue we are facing is after the first request sent, we wait for 5 more minutes and send the second one. The second one is very likely to end up with operation time out. When we enable the SDK logging, we wait to send the second request after the connection is closed (we were watching the log in the test). And this guarantees a operation timed out.
This part is what we want to ask. Normally if a connection is disconnected because of idle, I think that’s perfectly fine. We just restablish the connection and send the later requests out. However, my client does a little abnormal - unless I reset it, it will return timed out for each request I try to send later.
@xiduan -
Your connection is definitely being closed probably by something between the app and the server (firewall, LB, etc): The remote host (10.38.110.11:11210) has gracefully closed this connection.
In this case the SDK will re-establish a connection and for operation that allow retries, retry the operation on the new connection. Normally, the application never knows this happens and just works as expected. However, since the operation type is a mutation, retries are not allowed at the SDK level because it may have succeeded on the server and the client just never got the response.:
In this case, the application should retry the operation immediately if that is the behavior you wish to have. This second request will likely succeed assuming that whatever closed the connection initially hasn’t closed it again.
Hopefully, that helps!
-Jeff
Hi Jeff,
Thank you very much for pointing out the issue.
I do not quite understand that:
What we did is request a bucket from the ClustHelper and issue a generic Get option, like this:
IBucket statusBucket = ClusterHelper.GetBucket("status"); var statusGetResult = statusBucket.Get<CacheStatus>(key);
Please could you shed some light that how can I make it able to retry at SDK level?
Regards,
Xi
@xiduan -
I was referring to this log output:
You should only see that when using a some sort of mutation without using CAS.
I can’t see enough of the log to correlate the exact operation that is failing. A GET should be retried until it times out. However, the initial GET might spend its lifetime waiting for the connection response and then when that times out, then it too would time out and not be retried. The next operation should succeed in that case, assuming the connection wasn’t closed again. You can send me a more complete log using a private message if you wish and I can take a look.
This would simply be a retry loop in the application where you check the response and decide if you want to retry or not based on the status of the operation response.
-Jeff
Hi @jmorris
According to our observation, the Get operation was hanging for about 15 seconds before it returned back.
We print the operation result into the log if it is not a successful one as well - we can see the operation result failure happened 15 seconds later after the request was issued (between the 15 seconds there were some config check according to the SDK log) and we don’t do any retry between them.
I am eager to send you the full log but I can’t find a place to draft a private message. As a new member, I am not able to send an attachment as well.
Please could you advise how can I send the private message, or if it possible to send you by email?
Regards,
Xi
20180317-debug.zip (73.2 KB)
Hi @jmorris ,
I have managed to upload my SDK log here.
The problem is that once disconnected, it takes a long time (about 15s) to result in a timed out response and it didn’t recovery by itself because it will give the same timed out response if I send another request after the first one. Moreover, I tried to set up a timeout parameter with the Get operation in another test, but the timeout parameter given by myself was not respected - it still took about 15s to respond back the timed out message.
Both the client and the server are on the same LAN and I have disabled firewall on the server when doing the test.
The server version is 4.5.0 and the client version is 2.5.5.
Many Thanks,
Xi
@xiduan -
The PoolConfiguration.SendTimeout property controls the time the client will wait for a response from the server before raising SendTimeoutExpiredException. There is an example of the configuration on this page..
The real problem is that something between the client and the server is closing the connections. If you always need to timeout and then recreate the connections, you’ll never get the performance expected. I highly suggest you look deeper into what is causing the connections to close in your environment.
-Jeff
Hi @jmorris
Thank you very much for the explanation - we will check in depth in between the client and the server.
As you mentioned, if we have constant requests, there shouldn’t be any problem. I believe so as well, because the issue is caused by disconnected connections.
If the SDK is going to re-establish the connection. I assume it will take very short time to complete the new connection for local nodes, am I right?
Is there any estimated practice time for such rebuild connection process?
Regards,
Xi
Looking at your log files it is exactly 5 minutes between when the connection is started until it is disconnected.
(i.e. connection 58e60e3a-4c0d-4114-b8af-ea95e5c89519)
[DEBUG][2018-03-17 14:45:19,933] Couchbase.Authentication.SASL.ScramShaMechanism [8]
Client First Message 10.38.110.11:11210 - 58e60e3a-4c0d-4114-b8af-ea95e5c89519: n,n=cachedata,r=CfPttq8ZOx5epGtpC1E18o8nucvR [U:cachedata|P:
[WARN][2018-03-17 14:50:19,938] Couchbase.IO.ConnectionBase [49]
Handling disconnect for connection 58e60e3a-4c0d-4114-b8af-ea95e5c89519: Couchbase.IO.RemoteHostClosedException: The remote host (10.38.110.11:11210) has gracefully closed this connection.
If you have a multi-node Couchbase cluster I would suggest looking at the log files on those for cluster comms, you might find that node-to-node connections are also being dropped.
As a work to keep the connections active you could set the TCP keep alive to 60 seconds.
Current we are using single node in the development environment.
We found the cause of the cut-off of the connections that should keep-alive - the apache default setting on keep-alive timeout.
Changing the apache keep-alive timeout setting seems to solve the problem.
Thank you very much for the help!
Regards,
Xi