Memcached process stuck and timeout while moxi up
Hi, we use membase from the Git repo on Amazon EC2 with EBS volumes. Occasionnaly when adding/removing node and doing a rebalance we start to see timeout errors on some servers, rebalance fail. After some troubleshooting we figured out that the memcached process is running but timeout any request on its binary port. the process seems running, doing few IOs. Moxi still respond to the STAT command but will timeout when doing get/set. The only way we found to work around the issue is to kill the memcached process, it get automatically restarted, then the rebalance will work.
IOs on EBS are really bad, would it be the persistence that stuck the memcached process ? Why does the webinterface and moxi still report the node as up if all requests to memcached timeout ?