Unexplained NorthScale Server Failure
After over 100 days of uptime, our Northscale server suddenly failed to produce any memcache hits. I was able to access the NorthScale Console, which showed the system as "healthy", though the Analytics section showed "NaN" for all values. After rebooting the machine, the issue was resolved and memcached started producing hits again.
Though the details are light, is this a known issue? Is there anything else I can do to diagnose it?
diag output:
per_node_diag = [{'ns_1@<>',
[{version,
[{os_mon,"2.2.4"},
{kernel,"2.13.4"},
{emoxi,"1.0.3"},
{sasl,"2.1.8"},
{ns_server,"1.0.3"},
{dist_manager,"1.0.3"},
{menelaus,"1.0.3"},
{stdlib,"1.16.4"}]},
{config,
{config,
{full,"/etc/opt/NorthScale/1.0.3/config",undefined,
ns_config_default},
[[{port_servers,
[{'_ver',{0,0,0}},
{memcached,"/opt/NorthScale/1.0.3/lib/memcached",
["-E","/opt/NorthScale/1.0.3/lib/bucket_engine.so",
"-e",
"admin=_admin;engine=/opt/NorthScale/1.0.3/lib/default_engine.so;default_bucket_name
=default;auto_create=true",
"-p","11211","-B","auto","-m","15000"],
[{env,
[{"MEMCACHED_CHECK_STDIN","thread"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/etc/opt/NorthScale/1.0.3/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
{cd,"."}]}]},
{directory,"/etc/opt/NorthScale/1.0.3"},
{isasl,
[{'_ver',{0,0,0}},
{path,"/etc/opt/NorthScale/1.0.3/isasl.pw"}]}],
[{directory,
"/opt/NorthScale/1.0.3/bin/ns_server/config"},
{rest,[{'_ver',{0,0,0}},{port,8080}]},
{rest_creds,[{'_ver',{0,0,0}},{creds,[]}]},
{isasl,[{'_ver',{0,0,0}},{path,"./priv/isasl.pw"}]},
{bucket_admin,
[{'_ver',{0,0,0}},{user,"_admin"},{pass,"_admin"}]},
{port_servers,
[{'_ver',{0,0,0}},
{memcached,"./priv/memcached",
["-p","11211","-E","./engines/bucket_engine.so",
"-e",
"admin=_admin;engine=./priv/engines/default_engine.so;default_bucket_name=default;au
to_create=false",
"-B","auto"],
[{env,
[{"MEMCACHED_CHECK_STDIN","thread"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE","./priv/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]}]}]},
{alerts,
[{'_ver',{0,0,0}},
{email,[]},
{email_alerts,false},
{email_server,
[{user,undefined},
{pass,undefined},
{addr,undefined},
{port,undefined},
{encrypt,false}]},
{alerts,
[server_down,server_unresponsive,server_up,
server_joined,server_left,bucket_created,
bucket_deleted,bucket_auth_failed]}]},
{pools,
[{'_ver',{0,0,0}},
{"default",
[{port,11212},
{buckets,
[{"default",
[{auth_plain,undefined},
{size_per_node,64}]}]}]}]},
{nodes_wanted,['ns_1@<>']}]],
[[{otp,
[{'_ver',{1285,252651,6348}},
{cookie,occajlbqpnurbkcj}]},
{rest,[{'_ver',{1285,255903,898436}},{port,8080}]},
{rest_creds,
[{'_ver',{1285,255903,898636}},
{creds,[{"<>",[{password,'filtered-out'}]}]}]},
{port_servers,
[{'_ver',{0,0,0}},
{memcached,"/opt/NorthScale/1.0.3/lib/memcached",
["-E","/opt/NorthScale/1.0.3/lib/bucket_engine.so",
"-e",
"admin=_admin;engine=/opt/NorthScale/1.0.3/lib/default_engine.so;default_bucket_name
=default;auto_create=true",
"-p","11211","-B","auto"],
[{env,
[{"MEMCACHED_CHECK_STDIN","thread"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/etc/opt/NorthScale/1.0.3/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
{cd,"."}]}]},
{alerts,
[{'_ver',{0,0,0}},
{email,[]},
{email_alerts,false},
{email_server,
[{user,undefined},
{pass,undefined},
{addr,undefined},
{port,undefined},
{encrypt,false}]},
{alerts,
[server_down,server_unresponsive,server_up,
server_joined,server_left,bucket_created,
bucket_deleted,bucket_auth_failed]}]},
{pools,
[{'_ver',{1285,274007,38510}},
{"default",
[{port,11212},
{buckets,
[{"default",
[{auth_plain,{"default","default"}},
{size_per_node,12000}]}]}]}]},
{nodes_wanted,['ns_1@<>']}]],
ns_config_default}},
{basic_info,
[{version,
[{os_mon,"2.2.4"},
{kernel,"2.13.4"},
{emoxi,"1.0.3"},
{sasl,"2.1.8"},
{ns_server,"1.0.3"},
{dist_manager,"1.0.3"},
{menelaus,"1.0.3"},
{stdlib,"1.16.4"}]},
{system_arch,"x86_64-unknown-linux-gnu"},
{wall_clock,8837917},
{memory_data,
{18357985280,9145479168,{<0.68.0>,972344}}}]},
{memory,{18357985280,9145479168,{<0.68.0>,972344}}},
{disk,[{"/",10321208,28},{"/mnt",423135208,1}]}]}]
nodes_info = [{struct,[{hostname,<<"<>">>},
{status,<<"healthy">>},
{uptime,<<"8837916">>},
{version,<<"1.0.3">>},
{os,<<"x86_64-unknown-linux-gnu">>},
{memoryTotal,18357985280},
{memoryFree,9212506112},
{mcdMemoryReserved,12000},
{mcdMemoryAllocated,0},
{ports,{struct,[{proxy,11212},{direct,11211}]}},
{otpNode,<<"ns_1@<>">>},
{otpCookie,<<"occajlbqpnurbkcj">>}]}]
buckets = [{"default",
[{auth_plain,{"default","default"}},{size_per_node,12000}]}]
Unfortunately, we weren't able to get any other diagnostic information before rebooting the system, due to it being a critical system we had to fix quickly.
As far as the diagnostic information posted, that was the full report from the NorthScale Console (with some IP addresses redacted) before the reboot. Was there any other information you were expecting to see?
Actually I noticed a bit of the diagnostic output was cut off. Here it is:
logs:
-------------------------------
2011-01-03 17:09:36.660 ns_log:4:info:message - Log server started on node 'ns_1@<>'
Hi,
The reboot does explain the short log. Over the intervening releases since NorthScale Server 1.0.3, all the way towards Membase 1.6.4.1 (a superset of Memcached functionality), we've greatly improved diagnostic info capture & logging. You might want to take a look with an eye towards upgrade.
Cheers,
Steve
Hi,
There's (unfortunately) nothing suspicious in that diagnostic output, and it looks pretty abbreviated (likely due to the reboot).
Did you happen to notice anything else interesting before rebooting, such as high CPU, networking, other resource contention, etc?
Best,
Steve