Java SDK synchronization issue?
I have two services connecting to the same Couchbase cluster. One service (A) has two servers and performs approximately 40 reads/sec and 10 writes/sec. The other service (B) has three servers and makes around 500 reads/sec. Those figures are a mixture of single and bulk operations so the peak usage of the cluster is around 4K gets/sec and 80 sets/sec.
The cluster is made up of 8 nodes each with 16GB of RAM (Couchbase gets 9.35GB of that). The nodes are all running 2.0.0 and the Java SDK version is 1.1.1 on all services.
Yesterday evening following a small spike in timeouts reported from all three servers in service B two of the three servers from service B began to fail when talking to one particular node in the Couchbase cluster.
Thread dumps from those two servers show a large number of the following stack trace:
pool-6-thread-1 id=68 state=TIMED_WAITING - waiting on <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync) - locked <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253) at net.spy.memcached.internal.BulkGetFuture.internalGet(BulkGetFuture.java:140) at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:117) at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:49) at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1303) at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1342) at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1356) ; Our code begins from here
Service A and one server from Service B are displaying normal behaviour.
Is anyone aware of what might have happened to cause this problem? Is there a possibility that we might have fallen down some synchronization hold in the Java SDK. I'm initially pointing my finger at the SDK because of the other servers which are behaving correctly when talking to the same cluster but more than happy to look elsewhere if that's not the case.
The servers are still in this state at the moment (this is all happening in the background while we evaluate Couchbase under live load). Is there any more information which is required in order to help diagnose any problems?