Problem with spymemcached when server is failed over
I'm seeing a problem where, with a simple cluster of 3 machines with auto-failover enabled, spymemcached doesn't properly connect to a different node when one node is killed manually. The ultimate stack trace is as follows:
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancelled
Our client library wraps the spymemcached MemcachedClient asyncGet method, cancelling it if it exceeds a certain timeout. I've tried a 1000 ms timeout and a 5000ms timeout, and it didn't seem to affect the number of cancelled operations. The steps to reproduce it for us are as follows:
3 machines running membase-server-enterprise_x86_64_22.214.171.124.deb distribution of membase, auto failover enabled. Client (have tried spymemcached 2.7.2 and 2.7.3) started up using vbuckets with all 3 nodes passed into the uri list (/pools uri in all cases).
The test creates 50 records in membase (trying to ensure at least one on each server), then loops through each record and does a GET on each with a 1 second sleep in between each GET.
All of this works fine until, on one node, I forcibly kill the server with sudo pkill -u membase. Membase auto fails over (verified through the UI), at which point I continue to see the above stacktrace on roughly every 3rd GET. It seems like it's having trouble negotiating which server to get the values that for which the killed server was the primary from (possibly locking somewhere and timing out?). If I telnet to each server on port 11213, I see that the value I'm expecting is actually there on both of the remaining nodes.
If I then bring the killed server up again and rebalance the cluster, the client recovers gracefully.