Avoid cascading failures
I'm running a tracker service based on play framework, see http://www.playframework.com/ and use couchbase to store the trackings.
I have noted some problems with cascading errors where a single error can mean that the system gets blocked for many hours, probably due to contention.
What typically happens is this, from the log I see a message like
2013-11-13 00:31:08.015 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP: Connection problems with URI http://production.couchbase.node.5:8091/pools ...skipping java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method)
Meaning that a connection problem for one node has happened. After a little time I start to see messages like
2013-11-13 06:02:18.738 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Operation canceled because authentication or reconnection and authentication has taken more than one second to complete.
and the whole system becomes unresponsive including request handling but couchbase nodes are running fine again.
As my track storage is running asynchronously I believe that one problem is that set/add operations just keeps flowing into the system so buffers/queues just grows because no operations gets completed. So basically a temporary problems persist because the clients is overwhelmed by data.
Does the above conclusion sound valid. It currently looks like that the couchbase java driver has a hard time recovering from a temporary issue because to many operations gets queued into the system.
Have anybody experienced similar issues?
Are there any suggested methods/patterns to work around this. I'm currently considering make a kind of queue in front of the couchbase driver, such that we never have more than xxx request in flight to couchbase nodes.
I am not sure that is related to any contention. I can not be 100% positive but I feel that the issue is more related to some inactivity on the network and the sockets are then closed and cannot be reoponed.
Any though on a possible drop of socket on your network?
Can you confirm that when you have activity the system is working as exepected?