How do you determine why Membase reports a Memcached node is "down"?
This happens constantly; I will have the Membase web console working on a server, but it will report that a memcached node is down. However, I can telnet directly to memcached, and it works fine.
For example, right now if I go to my Membase console, go to Logs, and click "Generate Diagnostic Report", I get this reported for one of the nodes:
[CODE]nodes_info = [{struct,[{hostname,<<"10.20.30.40">>},
{status,<<"unhealthy">>},
{uptime,<<"0">>},
{version,<<"unknown">>},
{os,<<"unknown">>},
{memoryTotal,0},
{memoryFree,0},
{mcdMemoryReserved,642},
{mcdMemoryAllocated,276950},
{ports,{struct,[{proxy,11212},{direct,11211}]}},
{otpNode,<<"ns_1@10.20.30.40">>},
{otpCookie,<<"ucbwwmeuzmvtxwjh">>}]},[/CODE]
So it is reporting the node as down. If I RDP to the same box on which the Membase console is running, I can telnet and 'stats' memcached just fine:
[CODE]C:\> telnet 10.20.30.40 11211
STATS
STAT pid 5056
STAT uptime 596
STAT time 1285089610
STAT version 1.4.4_203_g7bf58ee
STAT pointer_size 32
STAT daemon_connections 10
STAT curr_connections 11
STAT total_connections 11
STAT connection_structures 11
STAT cmd_get 0
STAT cmd_set 0
STAT cmd_flush 0
STAT auth_cmds 0
STAT get_hits 0
STAT get_misses 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT bytes_read 13
STAT bytes_written 7
STAT limit_maxbytes 67108864
STAT rejected_conns 0
STAT threads 4
STAT conn_yields 0
STAT evictions 0
STAT curr_items 0
STAT total_items 0
STAT bytes 0
STAT engine_maxbytes 67108864
STAT bucket_conns 11
END[/CODE]
so... how does Membase determine whether or not a Memcached node is up or down? Side note, restarting the Membase windows service does not fix the issue, the web console always works, but the memcached node is always reported as being down. Sometimes they will both work, but usually after having to reboot windows, this issue starts to happen.
Perry, on this server, Windows Firewall is disabled. There is no aftermarket/3rd party firewall software either.
I can also telnet to these ports:
11211 (and 'stats' dumps output)
11212 (ans 'stats' dumps output)
4369 (can connect. pressing any key disconnects)
21100 (can connect)
I had changed the web console to port 11213 instead of 8080, because 8080 was in use by something else, so 11213 also works and gives the web console html page.
Can you check the port range of 21100-21199?
Thanks
Perry
Perry, it looks like erl.exe is listening on ports:
11212
11213 (web console, instead of 8080)
21100
epmd.exe is listening on port:
4369
memcached.exe is listening on ports:
11211
58615
Nothing is listening on or connected to ports 21101-21199, but there is no firewall that would prevent something from opening a connection on those ports.
Last night we rebuilt the Northscale cluster (service_unregister.bat and service_register.bat and rejoin cluster) on all the nodes. This fixed the issue for now, all the nodes are again showing as "up". It seems like the original issue of them being "down" even though they are running, starts when the server reboots for whatever reason. Sometimes after the Windows starts back up, restarting the Northscale service also fixes the issue, sometimes it does not.
Thanks for that info. Do these servers have multiple IP addresses? We definitely have some known issues around that...
Perry
Jeff, we just released beta 4 this morning which introduces support for memcached bucket types. I would be very interested to see if this version resolves your issue as we've made a number of improvements to the clustering and monitoring code.
Perry
I'm heavily suspicious of a firewall issue. I will check into what specific port we use to determine "upness" but can you check for any Windows firewall settings?
You should be able to telnet into all of these ports on both servers. You won't necessarily get any output, but a telnet should at least succeed to connect:
11211 (obviously working)
11212 (should be working)
4369 (possibly your problem)
8080 (obviously working)
21100 - 21199 (possibly your problem)
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Membase: http://www.membase.com/products-and-services/overview
Call or email "sales -at- membase -dot- com" today!