Couchbase failed to insert due to kvTimeout

Hello Couchbase Support Team,
We are encountering an issue where insert requests to the cache are failing with a timeout error during high-load situations. The error message is as follows:

"Failed when inserting into cache: InsertRequest, Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0x6dc425ca00000002","idempotent":false,"lastDispatchedTo":"...prod...","reason":"TIMEOUT","requestId":...,"requestType":"InsertRequest","retried":14,"retryReasons":["ENDPOINT_NOT_WRITABLE"],"service":{"bucket":"bucket-disableFlag","collection":"_default","documentId":"1.0.0-disableFlag-id-id",opaque":"0x4e216a0","scope":"_default","type":"kv","vbucket":45},"timeoutMs":2500,"timings":{"encodingMicros":11,"totalMicros":2510048}}"

We have found that increasing the “kvTimeout” setting to 10000 resolves the issue. However, we would like to explore alternative solutions that do not require increasing the timeout value.
Could you please advise on any right way or alternative ways to improve the situation?
Additionally, we have a question regarding the “retried”:14 parameters in the error message. Is this parameter configurable and can we reduce the number of retries? Also, are there any negative consequences of having a high number of retries?

Look at earlier messages from the SDK to find out why there were ENPOINT_NOT_WRITEABLE conditions. (com.couchbase.core info or perhaps debug). There may also be some clues in the server logs. The a node may be (temporarily) rejecting connections due to heavy load. Adding more nodes may improve performance. These retries don’t have much of a downside - after a short delay, the client just checks the endpoint and sees that it is (still) not available and then schedules another retry.

As improvements are continually being made to the SDKs, it’s beneficial to use the latest version.

logback.xml

<?xml version="1.0" encoding="UTF-8"?>
<appender name="console" class="ch.qos.logback.core.ConsoleAppender">
    <encoder>
        <pattern>%d %5p %40.40c:%4L - %m%n</pattern>
    </encoder>
</appender>

<root level="warn">
    <appender-ref ref="console"/>
</root>

<logger name="com.couchbase.core" level="info"/>"

This has a couple of main reasons:

  1. The Endpoint (TCP connection, essentially) isn’t connected currently. The SDK logging should show if this is the case.
  2. The network library we use (Netty) is reporting that it is not immediately ready to write to the connection. Since you report it’s happening under high load, this is most likely.

It can be challenging to debug exactly where the bottleneck is. I’d start with checking resource usage on both the cluster and application side, and GC logs on the application side.

You could try adding more KV connections from the SDK to the cluster, using numKvConnections (Client Settings | Couchbase Docs).

What Couchbase server version are you using? 7.0+ will make most efficient use of the connections.

Additionally, we have a question regarding the “retried”:14 parameters in the error message. Is this parameter configurable and can we reduce the number of retries? Also, are there any negative consequences of having a high number of retries?

You could use the FastFailRetryStrategy - but I wouldn’t recommend it. In any distributed system, some degree of retry is a necessity; networks are unreliable, and servers can be transiently overloaded.

The more retries the better the availability for a single request - but the more load added to the system. As with all things in distributed system, it’s a tradeoff. The default BestEffortRetryStrategy uses an exponential backoff that aims for a middle ground.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.