Java SDK synchronization issue?
I have two services connecting to the same Couchbase cluster. One service (A) has two servers and performs approximately 40 reads/sec and 10 writes/sec. The other service (B) has three servers and makes around 500 reads/sec. Those figures are a mixture of single and bulk operations so the peak usage of the cluster is around 4K gets/sec and 80 sets/sec.
The cluster is made up of 8 nodes each with 16GB of RAM (Couchbase gets 9.35GB of that). The nodes are all running 2.0.0 and the Java SDK version is 1.1.1 on all services.
Yesterday evening following a small spike in timeouts reported from all three servers in service B two of the three servers from service B began to fail when talking to one particular node in the Couchbase cluster.
Thread dumps from those two servers show a large number of the following stack trace:
pool-6-thread-1 id=68 state=TIMED_WAITING
- waiting on <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync)
- locked <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
at net.spy.memcached.internal.BulkGetFuture.internalGet(BulkGetFuture.java:140)
at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:117)
at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:49)
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1303)
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1342)
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1356)
; Our code begins from hereService A and one server from Service B are displaying normal behaviour.
Is anyone aware of what might have happened to cause this problem? Is there a possibility that we might have fallen down some synchronization hold in the Java SDK. I'm initially pointing my finger at the SDK because of the other servers which are behaving correctly when talking to the same cluster but more than happy to look elsewhere if that's not the case.
The servers are still in this state at the moment (this is all happening in the background while we evaluate Couchbase under live load). Is there any more information which is required in order to help diagnose any problems?
The futures weren't completing in time and were logging timeouts using the default operation timeout against one particular node from the cluster (I think this is 2500ms in net.spy.memcached.DefaultConnectionFactory). We haven't altered this.
Unfortunately I wasn't able to see the Web-UI at the time of the incident but we scrape the bucket stats JSON and dump that into Graphite for each node in the cluster and get_hits doesn't show any drop.
Looking at our garbage collection stats it seems that the two servers which had problems were performing lots of full garbage collections following the problem. It's odd that the other server wasn't affected but I think I have some garbage investigation to do before pointing the finger at the SDK.
Thanks,
Neil
Hi Neilprosser,
The stack trace shows that lots of the bulk loading operations are waiting for their futures to be completed. I don't think this necessarily is a problem, but in the end - did the futures time out? Can you show us parts of the code that caused the issue?
Also, did you see useful information int he Web-UI graphs at the time of the problem? Did the get/s drop? At the end of the day, those futures need to time out or complete, what comes first. The future should provide more helpful information on what happened.
Thanks
Michael