OutOFMemory issue due to couchbase

Hi All,
There is many times we are seeing c.c.c.d.i.n.u.i.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 3204448263, max: 3221225472) errors after our application reaches to 100% traffic.
We have application running in gcp with below java configuration
ENV JAVA_OPTS="-Xms3g -Xmx3g -XX:+HeapDumpOnOutOfMemoryError"
ENV GC_OPTS=" -verbose:gc -XX:+PrintGCDetails -XX:+ParallelRefProcEnabled -XX:+PrintGCDateStamps -XX:+UseG1GC -XX:MaxGCPauseMillis=100"
where pod resouce is 5gb for application container and 2gb for istio.
We are seeing lots of spike in container memory reaching to 4gb where it got crashed though jvm memory stay below 2gb. During that period also we observe cb read operation upto 2.5 k per node(total 10 node). This is so frequently happening.
Below is the couchbase configuration
/**

  • Create Cluster,Connection by taking all properties.

  • try to connect to the host and bucket name of Couchbase database.
    */
    @PostConstruct
    public void setup() {
    CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()

    .keepAliveTimeout(aliveTimeout).connectTimeout(connectionTimeout).socketConnectTimeout(socketTimeout)
    .queryTimeout(queryTimeout).build();
    try {
    couchbaseCluster = CouchbaseCluster.create(env, nodes);
    couchbaseCluster.authenticate(userName, password);
    asyncPromoBucket = couchbaseCluster.openBucket(bucketName).async();

}  catch (Exception e) {
  log.error("failed trying to connect from couchbase Cluster ", e);
}

}

/**

  • Disconnect from Couchbase server
  • releasing values during shut down
    */
    @PreDestroy
    public void preDestroy() {
    try {
    if (this.couchbaseCluster != null) {
    this.couchbaseCluster.disconnect();
    }
    } catch (Exception e) {
    log.error("failed trying to disconnect from couchbase Cluster ", e);
    }

}
aliveTimeout: 10000000
socketTimeout: 30000
connectionTimeout: 50000
requestTimeout: 2000
poolSize: 250
queryTimeout: 1000
Here is the complete log for exception details
{“timestamp”:“2020-07-07T19:06:58.850-04:00”,“logger_name”:“com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline”,“severity”:“WARN”,
“message”:"An exceptionCaught() event was fired, and it reached at the tail of the pipeline.
It usually means the last handler in the pipeline did not handle the exception.
",“stack_hash”:“e21e277a”,“stack_trace”:"c.c.c.d.i.n.u.i.OutOfDirectMemoryError: failed to allocate 16777216

byte(s) of direct memory (used: 3204448263, max: 3221225472)\n\tat c.c.c.d.i.n.u.i.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:535)\n\tat c.c.c.d.i.n.u.i.PlatformDependent.allocateDirectNoClean

er(PlatformDependent.java:489)\n\tat c.c.c.d.i.n.b.PoolArena$DirectArena.allocateDirect(PoolArena.java:766)\n\tat c.c.c.d.i.n.b.PoolArena$DirectArena.newChunk(PoolArena.java:742)\n\tat c.c.c.d.i.n.b.PoolArena.allo

cateNormal(PoolArena.java:244)\n\tat c.c.c.d.i.n.b.PoolArena.allocate(PoolArena.java:226)\n\tat c.c.c.d.i.n.b.PoolArena.allocate(PoolArena.java:146)\n\tat c.c.c.d.i.n.b.PooledByteBufAllocator.newDirectBuffer(Poole

dByteBufAllocator.java:333)\n\tat c.c.c.d.i.n.b.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:183)\n\tat c.c.c.d.i.n.b.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:174)

\n\tat c.c.c.d.i.n.b.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:135)\n\tat c.c.c.d.i.n.c.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)\n\tat c.c.c.d.i

.n.c.n.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)\n\tat c.c.c.d.i.n.c.n.NioEventLoop.processSelectedKey(NioEventLoop.java:646)\n\tat c.c.c.d.i.n.c.n.NioEventLoop.processSelectedKeys

Optimized(NioEventLoop.java:581)\n\tat c.c.c.d.i.n.c.n.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)\n\tat c.c.c.d.i.n.c.n.NioEventLoop.run(NioEventLoop.java:460)\n\tat c.c.c.d.i.n.u.c.SingleThreadEventE

xecutor$2.run(SingleThreadEventExecutor.java:131)\n\tat c.c.c.d.i.n.u.c.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:748)\n",“trace”:"",“span”:"",“parent”:"",

“class”:“c.c.c.d.i.n.c.DefaultChannelPipeline”,“service”:“PromotionsExecutionService”}

The method to fetch data from cb
final AsyncBucket asyncPromoBucket = cbConfiguration.getAsyncPromoBucket();
if (asyncPromoBucket == null) {
log.error(“Error in Couchbase connection”);
throw new PromoExecutionException(“Error in connection couchbase”);
}

long t1 = System.nanoTime();
Observable<ItemPromo> itemPromo = asyncPromoBucket.get(docId)
    .retryWhen(RetryBuilder
        .anyOf(BackpressureException.class, RequestCancelledException.class,
            TimeoutException.class)
        .delay(Delay.fixed(delay, TimeUnit.MILLISECONDS)).max(maxAttempt).build())
    .doOnError(e -> {
      log.error("Error while retrieving data from CB :{}", e);
      throw new PromoExecutionException("Error while retrieving data from CB...");
    })
    .map(RepositoryHelper.parseItemPromoJson).switchIfEmpty(Observable.just(new ItemPromo()))
    .timeout(timeout, TimeUnit.MILLISECONDS, Schedulers.io());
long t2 = System.nanoTime();

return itemPromo;

Need immediate help.
Thanks in advance

failed to allocate 16777216 byte(s) of direct memory (used: 3204448263, max: 3221225472)

it fails getting a 16MB buffer, (it is configured for 3G maximum). The poolSize=250. 250 x 16M = 4G. I’m thinking with a poolSize of 100, that would make 100 x 16M = 1.6G. Even with the CB max document of 20M, that would max out at 2.0G.

Thanks for your quick reply. Though we defined poolSize in yaml but we are not using it in Couchbase configuration that means as we have 4 core based cpu it will create 4*16M whatever the calculation of buffer. Another thing is maxDirectMemory(buffer) we have not configure, thus by default it takes based on xmx heap size i believe.

You could use some of the techniques here for troubleshooting OutOfDirectMemoryError. https://github.com/reactor/reactor-netty/issues/800
It would be useful to identify what version of Couchbase SDK you are using (so the version of netty could be identified)
It may also be useful to identify if this is a transient problem (the problem disappears when load decreases) - indicating this is a result of load; or if this is a permanent problem (once the problem occurs it continues to occur regardless of load) - indicating a memory leak.
“During that period also we observe cb read operation upto 2.5 k per node(total 10 node). This is so frequently happening.”
Is that 2.5k reads per second? Times 10 nodes would be 25,000 reads per second - all from the same client JVM? That’s 40 micro-seconds per request (micro, not milli). That’s pretty much the limit that I’ve seen from a four-core client with the couchbase 3.0 SDK. At that point, I would expect requests (and responses) to be queued up and holding memory. Perhaps the solution is to add more client resources.