All buckets block briefly on memcache bucket eviction
We're seeing an interesting pattern:
We have 5 buckets, 4 of which are memcached, one of which is membase (we're on 1.7.1.1), spread across 20 machines. 3 memcached and the membase buckets operate at somewhere between 3000 and 10000 ops/second, and are somewhere between 2 and 15 percent of several tens of GB across the cluster filled. No big deal.
The other memcache bucket, though, operates at 15-20K ops/sec and is 90% full with a cluster-wide 700GB.
Needless to say, we see a fair amount of eviction occurring on this big bucket.
The interesting part is that when an eviction occurs, ops/sec in _all other buckets_ drop to zero briefly -- cluster-wide. The block is somewhere between half a second and a couple of seconds. Given the rate at which the other buckets are being queried, we see a slew of timeouts from every app server against every bucket.
I had figured that memcached eviction would be node-specific, so a block for "garbage collection" might affect one node at a time, but we're seeing that it seems to affect all nodes in the cluster.
We've been counting on (quasi-) LRU-based eviction to maintain the bucket at 90% or so, but this behavior seems to indicate that that might not be such a good idea. I can move all the smaller buckets to a different cluster, and at least isolate the problem (we've tested that theory), but the problem would still remain on the big bucket.
Are there any settable eviction policies that might alter this behavior? What options might I have other than coming up with a client-side eviction mechanism?
Thanks
Paul