Frequent node drop & wait_for_memcached_failed errors
We ran v 1.7.x for many months, but recently we have had a lot of problems where our 9 node, Ubuntu-based cluster would sporadically drop a few nodes. It would usually be a case where one node would fail to see a heartbeat from another node and cause it to be marked pending. When this would happen, we would usually have a LOT of problems getting the cluster to rebalance, getting frequent (but unhelpful) wait_for_memcached_failed errors in the log. Then, on one of the re-balance "clicks," it would just work - no configuration changes.
Yesterday morning we upgraded to v 1.8.1, specifically because of the fixes/changes to the service restarting and re-balancing functionality. What seems too have happened is that the cluster seems MORE sensitive to latency among nodes, now dropping 6 or more of the 9 nodes at a time and doing so several times each day - and we are still having all the same re-balancing problems. I really don't know where/what to look at next.
I have a diagnostic file, but, to be frank, what's inside is beyond my expertise with this system. This is a system that we really just expect to work (and DOES work in an identical configuration other environments in our company). Is there anyway to definitively identify WHY nodes are being dropped so often and why they are so hard to reintegrate/re-balance? I suspect that we have latency issues between them on the network, but they are all in the same one or two racks, so it's not like they are spread across a WAN or the like. Is there anyway to identify the specific problem from the diagnostic file and, if it is latency issues, is there a way to adjust the heartbeat to make the nodes less sensitive?