Soft lock when spamming cluster
So we ran into an odd behavior yesterday that I haven't been able to find mention of.
We had a 4 node cluster that we were spamming, to one 400MB bucket with three replicas, with two connections with two clients connected and each doing a tight loop that was doing a set with random keys and values. This would work fine for a couple of minutes but would quickly reach a place where the client would get nothing but temporary out of memory errors from the server. That is expected if you are able to push too much data to the server too quickly and it hasn't been able to drain the disk queue yet.
The problem is that we had, apparently, done the same thing to every single server in the cluster at the same time and, at least from the outside, it looks as though the replication logic uses the same mechanism as the clients to when determining if a write can happen to a server, because each server was stuck in a loop of trying to replicate its data and failing.
In looking at the state for the TAP Queues we can see that the graphs for drain rate and back-off rate are in lock step with each other. No server can replicate to another, so nothing seems to be draining from memory, so nothing can be added, so that bucket was basically rendered useless. We haven't restarted one of the servers yet so not sure if that would fix the problem.
It's possible this is entirely related to our setup. The test cluster is made up of VM's running on the same machine. Each one was given access to all 8 cores and had 4GB of memory per VM.
Anyone have any thoughts?