Couchbase
  • Why NoSQL?
  • Couchbase Server
  • Download
  • Resources
  • Careers
Home | Forums | SDKs | SDKs

Java SDK synchronization issue?

2 replies [Last post]
  • Login or register to post comments
Wed, 02/06/2013 - 02:31
neilprosser
Offline
Joined: 10/23/2012
Groups: None

I have two services connecting to the same Couchbase cluster. One service (A) has two servers and performs approximately 40 reads/sec and 10 writes/sec. The other service (B) has three servers and makes around 500 reads/sec. Those figures are a mixture of single and bulk operations so the peak usage of the cluster is around 4K gets/sec and 80 sets/sec.

The cluster is made up of 8 nodes each with 16GB of RAM (Couchbase gets 9.35GB of that). The nodes are all running 2.0.0 and the Java SDK version is 1.1.1 on all services.

Yesterday evening following a small spike in timeouts reported from all three servers in service B two of the three servers from service B began to fail when talking to one particular node in the Couchbase cluster.

Thread dumps from those two servers show a large number of the following stack trace:

pool-6-thread-1 id=68 state=TIMED_WAITING
    - waiting on <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync)
    - locked <0x5fa4d3c6> (a java.util.concurrent.CountDownLatch$Sync)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
    at net.spy.memcached.internal.BulkGetFuture.internalGet(BulkGetFuture.java:140)
    at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:117)
    at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:49)
    at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1303)
    at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1342)
    at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1356)
    ; Our code begins from here

Service A and one server from Service B are displaying normal behaviour.

Is anyone aware of what might have happened to cause this problem? Is there a possibility that we might have fallen down some synchronization hold in the Java SDK. I'm initially pointing my finger at the SDK because of the other servers which are behaving correctly when talking to the same cluster but more than happy to look elsewhere if that's not the case.

The servers are still in this state at the moment (this is all happening in the background while we evaluate Couchbase under live load). Is there any more information which is required in order to help diagnose any problems?

Top
  • Login or register to post comments
Thu, 02/07/2013 - 02:03
daschl
Offline
Joined: 05/19/2012
Groups: None

Hi Neilprosser,

The stack trace shows that lots of the bulk loading operations are waiting for their futures to be completed. I don't think this necessarily is a problem, but in the end - did the futures time out? Can you show us parts of the code that caused the issue?

Also, did you see useful information int he Web-UI graphs at the time of the problem? Did the get/s drop? At the end of the day, those futures need to time out or complete, what comes first. The future should provide more helpful information on what happened.

Thanks
Michael

Top
  • Login or register to post comments
Thu, 02/07/2013 - 06:41
neilprosser
Offline
Joined: 10/23/2012
Groups: None

The futures weren't completing in time and were logging timeouts using the default operation timeout against one particular node from the cluster (I think this is 2500ms in net.spy.memcached.DefaultConnectionFactory). We haven't altered this.

Unfortunately I wasn't able to see the Web-UI at the time of the incident but we scrape the bucket stats JSON and dump that into Graphite for each node in the cluster and get_hits doesn't show any drop.

Looking at our garbage collection stats it seems that the two servers which had problems were performing lots of full garbage collections following the problem. It's odd that the other server wasn't affected but I think I have some garbage investigation to do before pointing the finger at the SDK.

Thanks,
Neil

Top
  • Login or register to post comments
  • Login or register to post comments
  • Login
  • Register

Company

  • About Us
  • Leadership
  • Customers
  • Partners
  • Contact Us

Product

  • Couchbase Server
  • Couchbase SDKs
  • Use Cases
  • Documentation
  • Forums

Open Source

  • Couchbase Project
  • Couchbase vs. CouchDB

Commercial

  • Subscriptions & Support
  • Training & Services

News

  • Blog
  • Newsletter
  • Press Releases
  • Buzz

Follow Us

    
  • Customer Login
  • Terms of Service
  • Privacy Policy
  • Trademark Policy
  • Site Map

© 2013 COUCHBASE All rights reserved.

Sign in to Couchbase Community

close
  • Create new account
  • Request new password
You are logging into the Forums, Wiki and Issue Tracker