Membase service needs to be restarted frequently
We've successfully upgraded to 1.7.1 our numerous Windows servers, thanks to Perry's patch.
Our membase servers are on the same machines as our web servers.
Right now our web apps don't talk to Membase, we're in a testing/migration phase where Membase services are up and running but we only hit them w/ a test app every minute or so and from time to time a test page that tries to Set/Get a value.
Basically an extremely small, to not say insignificant amount of traffic for 9 clustered Membase servers.
However quite frequently, in the order of about 2 to 3 times a week, one of the 9 Membase servers at random will get in a funky state where the service is up but Membase client cannot Set or Get a value on it.
The solution so far has been to RDP into that server and restart the Membase service. That solves the issue, at least until another server gets into that funky state.
Our test app that uses the same Membase client and has logs turned on gives us the following information:
2011-09-06 13:51:17,488  ERROR Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl Could not init pool.
System.TimeoutException: Could not connect to 10.2.5.10:11210
at Enyim.Caching.Memcached.PooledSocket.ConnectWithTimeout(Socket socket, IPEndPoint endpoint, Int32 timeout)
at Enyim.Caching.Memcached.PooledSocket..ctor(IPEndPoint endpoint, TimeSpan connectionTimeout, TimeSpan receiveTimeout)
Our test page, which is displaying various properties available on the Membase client does show that the problematic Membase server has a flag "IsAlive" set to FALSE when this issue occurs.
The one thing I haven't tried yet is to connect to the web console on that host when that occurs to see what happens, but I'll update this post whenever that happens again.
We'd really need to have a more stable Membase cluster out there before we can start directing actual traffic to it, but I'm not sure where to look right now to troubleshoot this issue...
Hoping someone can help.