I’ve been struggling with a problem in the Java SDK for a while, which I assumed was a deadlock in my code, but I couldn’t understand how a simple “replace” could possibly deadlock with itself.
It turns out that in a call much earlier I had deadlocked one of the “worker” threads in the thread pool maintained by CoreScheduler and then this later call had been assigned to the same “worker” thread. The consequence of this was that the earlier deadlock meant that the response to the replace request was “never” processed.
Now that I’ve managed to unearth the problem, I can go ahead and figure out the original deadlock and then presumably everything will work, but I’m wondering if it’s possible to make it easier to diagnose these problems. In order to figure this out, I had to (or at least I did) download the source to the couchbase & rxjava libraries and add tracing to them until I could understand what was going wrong (and in particular, that the queue for one of the thread pool workers was growing). I was then able to identify the thread at play and see what was going on.
Part of the problem is probably a failure on my part to be consistent at adding timeouts everywhere (my next step is to try and add such a timeout and check that it throws an exception).
Would it be possible to add some kind of automatic diagnostic that is called on a timer or otherwise that can review the status of the thread pools and raises some kind of alert if it seems that there is a backup on one or more of them? Or does such a facility already exist and I need to turn it on?
Finally, I’m unclear why there appears to be a nested pool of pools. It would seem that the CoreScheduler maintains a pool of ThreadPoolExecutors, each of which has just one thread. It seems to me that while my code would still have been wrong, if this were not the case it would not have blocked in this way. Whether the alternative would have been better or worse, I wouldn’t like to say