Membase EC2 timeout error on 3 node cluster
Hello,
We are having intermittent issues with our 3 node cluster.
Randomly connection is lost with one of the nodes and we get this error.
======
Node ('ns_1@10.48.137.138') was automatically failovered.
auto_failover001 ns_1@10.226.226.6 08:39:06 - Sun Feb 19, 2012
Failed over 'ns_1@10.48.137.138': ok ns_orchestrator006 ns_1@10.226.226.6 08:39:06 - Sun Feb 19, 2012
Shutting down bucket "default" on 'ns_1@10.48.137.138' for server shutdown ns_memcached002 ns_1@10.48.137.138 08:35:52 - Sun Feb 19, 2012
Port server memcached on node 'ns_1@10.48.137.138' exited with status 134. Restarting. Messages: exception caught in task Fetching item from disk: 7c7739e276e6e10e66a29617f93c8b92: Unhandled case in sqlite-pst: 17 (database schema has changed)
Object unexpectedly changed size by 4 bytes
memcached: stored-value.cc:367: static void StoredValue::increaseCacheSize(HashTable&, size_t, bool): Assertion `ht.cacheSize.get() < ((size_t)1<<(sizeof(size_t)*8-1))' failed. ns_port_server000 ns_1@10.48.137.138 08:35:31 - Sun Feb 19, 2012
Control connection to memcached on 'ns_1@10.48.137.138' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
stats_recv,
4},
{mc_client_binary,
stats,
4},
{ns_memcached,
handle_call,
3},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]}
====
This also happened the day before with the same error message. I left three ssh consoles open to the servers and noticed that two of them where disconnected this morning. So is it Ec2 that is having random network issues or what? We have been having this issue for a while now and is extremely annoying!
I have also checked the server logs and noticed that I got moxi segfaults but im not sure its related.
====
[2527244.634510] moxi[9330]: segfault at 40 ip 00000040 sp b5725530 error 14 in moxi[8048000+a7000]
[2778981.638250] moxi[18570]: segfault at 0 ip (null) sp b586e530 error 14 in moxi[8048000+a7000]
[2779059.356534] moxi[23437]: segfault at 0 ip (null) sp b5755530 error 14 in moxi[8048000+a7000]
[2779080.212025] moxi[23447]: segfault at fffff0 ip 0805e3b4 sp b5721530 error 4 in moxi[8048000+a7000]
[2779107.689050] moxi[23455]: segfault at 400 ip 00000400 sp b582d530 error 14 in moxi[8048000+a7000]
root@ip-10-227-55-162:/opt/membase/bin# cat /etc/issue
Debian GNU/Linux 6.0 \n \l
===
Any help would be appreciated...
Chris
It sounds like network connectivity issues.
Questions:
1) can you verify they're all in the same region, and preferably in the same availability zone?
2) can you confirm they are not micro instances? micro instances can become starved for far too long of resources, triggering failover.
Otherwise, we'd have to gather more info. This doesn't sound expected. There are many people running on EC2.