Question about ep_tmp_oom_errors and failed writes
I am frequently running into issues "bulk" loading data into my membase cluster. I am fairly certain this behavior is mostly expected (from the wiki article "consequences of memory faster than disk"), so I'm cool with that. On my production hardware I can throw faster disks at the problem to mitigate the SERVER_ERROR / "temporary failure" exceptions that seem to get thrown when the disk write queue gets too large.
However, I have encountered a scenario which I took some screenshots of here:
One screenshot is of the monitor GUI for the bucket, the other is output produced by my client, which should be fairly self explanatory.
Basically, the disk write queue is empty, and every time I attempt to "set" an object, ep_tmp_oom_errors increases by 1. My client tries to gracefully back-off (up to 10 times) and waits longer between each attempt, but eventually, if the I get a server error / "temporary failure" 10 times, I give up on the write.
I also managed to collect mb_collectinfo output from all 3 nodes while this was happening if that is useful. I don't really have a complete enough understanding of membase to really understand what's going on under the covers that causes this to happen.