Membase service monitoring
Hi,
We are using Membase Server 1.7.1.1 and Enyim client version 2.11.
Using the client property:
Enyim.Caching.Memcached.IMemcachedNode.IsAlive
We've created a monitoring web page that we ping every minute to see if any Membase service is either shutdown or in a funky state.
The reason we made that page is because simply monitoring if the Windows service was up didn't catch some of the issues we've seen where a Membase server goes into a weird state and doesn't answer requests anymore although it is still technically running.
We have a thread on this:
http://www.couchbase.org/forums/thread/membase-service-needs-be-restarte...
This hasn't happened in a while, which is why the thread is not moving recently.
At any rate our issue is that hitting this page on each server doesn't return a very consistent result.
In fact, if we consider a Membase cluster with 8 servers. Each of which is also serving our web app. If for example we turn off Membase on machine 'F', when we hit our monitoring web page on all servers, only one box will show the ISALIVE property as false for machine 'F', generally not the same box (here machine 'F').
All the other machines show everything is ok, ISALIVE = True.
I believe Membase is built using Erlang and at first I thought well maybe it takes a minute for the Mnesia database to propagate to all servers. But after 10 min, still only that one server is showing that machine 'F' is not serving Membase requests.
That machine is nothing special, a couple days ago we tried turning off the Membase service on the same server and another box picked up that Membase was down on the box 'F', but none of the others.
Is there some caching issue w/ the cluster state? Is there like a "NameNode" that's the only box managing server state in the cluster? Anything of that nature that would explain the behavior observed?
Thanks.
I think you need an updated Enyim client. There were some recent fixes that may explain this. When the connection is momentarily dropped, the server was getting marked automatically as "dead" by Enyim until restart. That's now been fixed and there is a reconnect, though there's possibly a better reconnect routine coming.
The current stable client can be downloaded here: http://www.couchbase.org/code/couchbase/net