SDK hangs all request when one connection to only one couch server have network delays

I’m testing couchbase with a java application, in the testing i’m observing the following behaviour:
One application server is doing a loop of gets for random couch docs in a long and constant rate of 2.5K requests per second.

while the test run i’ve set a network delay from the application server to one of the couch servers, for one second, at that point i observe an even degradation of ALL gets to ALL couch servers and not only to the specific couch server alone.

is this a normal behaviour of the sdk implementation?

next step i’m going to try and change the retry policy to fail fast, i think it might improve it. will update.
UPDATED: it didn’t help at all.

just so you will have more info about the test:

  • 3 couch servers, each have 4 cpu’s and 8 gb memory
  • the bucket i tested on is 100mb bucket
  • each doc is 10k
  • the app server have 8 cpu’s 8GB
  • the jvm gets 4gb
  • the sdk is started with one socket per server, 8 io threads and 8 computation threads, kv timeout of 2500ms (the default, changing it to 500ms didn’t change the test, increasing it higher also didn’t change anything) , adding more threads or less threads also did not change

will be happy for any input…

No answers yet here, but here is the input i have so far.
adding a some node level circuit breaker logic before couchbase sdk did improve the load on other non impacted couchbase nodes, but still not to the high % that i would expect it too, we are adding a better circuit breaker implementation now, and hope it will also improve it.

for now, it looks like using couchbase without having a node level Circuit Breaker implementation doesn’t make sense at all, having CB implementation with SDK based retries also doesn’t seams to make a lot of sense, retry and fail logic need to be in the app level only.

@oded are you seeing backpressure exceptions happening when you arbitrarily increase the network delay? I’d like to work with you through this specific thing you are seeing.

Also if you need a node-based circuit breaker there is one class which can help you build one under -> you can ask it for server infos for a given key and it will update itself based on the current config.

Finally, we are in general looking for strategies and internal circuit breakers as well as in the java client switch away from a central ringbuffer to individual node ones but we are still in the planning stages.

Happy to hear the last sentence about the ringbuffer per node, which make sense, we were thinking on doing it ourselves, but there are different strategies that need to be considered with having different ringbuffer, i’m pretty sure it wont be enough to resolve the issue.

about the nodelocator, we are using it for the CB, as i wrote before it seams that it does help but i now suspect that nodelocator is slow by its own and that is the reason that we are not getting back to the throughput on the other nodes… but this is only an assumption for now, it might be something in the CB implementation, will check both assumptions tomorrow.

Will be happy to work with you on the issue we are experiencing.

we were able to improve the circuit breaker logic on our code side, so it is now jumping back to 3K~ ops but still far from 6K, the next suspect is the nodelocator that is probably much slower than the amount of ops the test is able to request, still just an assumption, we need to keep checking it…