Java SDK memory leak with bad cluster

Found memory leak in Java SDK 2.1.6. It happens only with specific conditions - when we got problems with a cluster.
Preconditions:

  1. Created cluster_1 with 3 nodes.
  2. 3rd node has a problem with size of HDD - it’s full.
  3. On client we from time to time can get error: com.couchbase.client.java.error.CouchbaseOutOfMemoryException, but there is no memory leaks.
  4. We have Couchbase client (JDK) which periodically performs reconnect do Couchbase cluster.

Conditions:
5. Starting XDCR replication from external Couchbase cluster_2 to our problem cluster_1.
6. On Couchbase cluster_1 we can see a messages like this:

[14:38:25] - Approaching full disk warning. Usage of disk “F:” on node “172.20.112.93” is around 91%.
[14:38:25] - Approaching full disk warning. Usage of disk “F:” on node “srv3” is around 100%.
[14:38:25] - Hard Out Of Memory Error. Bucket “ufm_2” on node srv3 is full. All memory allocated to this bucket is used for metadata.

  1. And now we can see that next classes have a memory leak:
    com.couchbase.client.deps.io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
    com.couchbase.client.core.ResponseEvent
    com.couchbase.client.core.RequestEvent

Example of code to model such a problem:

package com.ufm.api;

import com.couchbase.client.java.Bucket;
import com.couchbase.client.java.CouchbaseCluster;
import com.couchbase.client.java.env.CouchbaseEnvironment;
import com.couchbase.client.java.env.DefaultCouchbaseEnvironment;
import org.junit.Test;

public class MemoryLeakTest {

@Test
public void testCBMemoryLeak() throws Exception {

    while (true) {
        ConnectionContainer connectionContainer = getConnectionContainer();
        Thread.sleep(200);
        closeConnection(connectionContainer);
        Thread.sleep(200);
    }

}

private void closeConnection(ConnectionContainer connectionContainer) {
    connectionContainer.getBucket().close();
    connectionContainer.getCouchbaseCluster().disconnect();
}

private ConnectionContainer getConnectionContainer() {
    ConnectionContainer connectionContainer = new ConnectionContainer();

    CouchbaseEnvironment couchbaseEnvironment = DefaultCouchbaseEnvironment.builder()
            .kvTimeout(5000L)
            .build();
    connectionContainer.setCouchbaseEnvironment(couchbaseEnvironment);

    CouchbaseCluster cluster = CouchbaseCluster.create(couchbaseEnvironment, "http://srv3:8091");
    connectionContainer.setCouchbaseCluster(cluster);

    Bucket bucket = cluster.openBucket("ufm_2", "1111");
    connectionContainer.setBucket(bucket);
    return connectionContainer;
}


private class ConnectionContainer {
    private CouchbaseEnvironment couchbaseEnvironment;
    private CouchbaseCluster couchbaseCluster;
    private Bucket bucket;

    public CouchbaseEnvironment getCouchbaseEnvironment() {
        return couchbaseEnvironment;
    }

    public void setCouchbaseEnvironment(CouchbaseEnvironment couchbaseEnvironment) {
        this.couchbaseEnvironment = couchbaseEnvironment;
    }

    public CouchbaseCluster getCouchbaseCluster() {
        return couchbaseCluster;
    }

    public void setCouchbaseCluster(CouchbaseCluster couchbaseCluster) {
        this.couchbaseCluster = couchbaseCluster;
    }

    public Bucket getBucket() {
        return bucket;
    }

    public void setBucket(Bucket bucket) {
        this.bucket = bucket;
    }
}
}

hi @dpozhidaev, sorry for the late answer…
Looks similar to something reported in a Netty bug: io.netty.channel.ChannelOutboundBuffer$Entry weird behaviour · Issue #4134 · netty/netty · GitHub

There is a possible workaround, but it would require you to upgrade the SDK from 2.1.6 to 2.2.7 (the latest in the current series), because it requires a newer version of Netty.

The workaround is to disable the netty pooling by starting the JVM with the following option:

-Dcom.couchbase.client.deps.io.netty.recycler.maxCapacity=0

This will however produce more GC pressure. The upstream Netty bug and its impact on the SDK is tracked in our own ticket, JCBC-951.

:warning: Keep in mind there’s been a few behavioral changes in 2.2.0 (most notably, in the async API no request is triggered before you call subscribe(...) on an Observable). See the release notes for the 2.2.x series.

Thank you.
Moved to newest SDK (2.2.7), but it didn’t help. Thank you for advice about using maxCapacity - I will try it.

@dpozhidaev did setting the capacity help? I saw that there have been some upstream changes in netty to the recycler which we can pick up once released in later java SDK releases…