Details
Description
Have not seen this issue previously ever on 181. Seems to be something new.
Test case that failed
swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=3,num-buckets=5,num-swap=2,swap-orchestrator=True,keys-count=800000
1) Failover 2 nodes, including the orchestrator
2) Rebalance them out
3) Add them back and rebalance in, without cleaning them.
[rebalance:error] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27930.16>:ns_rebalancer:do_wait_buckets_shutdown:170] Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2",
"bucket-1",
"bucket-0",
"bucket-4",
"bucket-3"]}}}]
[user:info] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27893.16>:ns_orchestrator:handle_info:246] Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2","bucket-1","bucket-0",
"bucket-4","bucket-3"]}}}]}
Test case that failed
swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=3,num-buckets=5,num-swap=2,swap-orchestrator=True,keys-count=800000
1) Failover 2 nodes, including the orchestrator
2) Rebalance them out
3) Add them back and rebalance in, without cleaning them.
[rebalance:error] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27930.16>:ns_rebalancer:do_wait_buckets_shutdown:170] Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2",
"bucket-1",
"bucket-0",
"bucket-4",
"bucket-3"]}}}]
[user:info] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27893.16>:ns_orchestrator:handle_info:246] Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2","bucket-1","bucket-0",
"bucket-4","bucket-3"]}}}]}
[user:info] [2012-05-10 8:45:32] [ns_1@10.1.3.77:'ns_memcached-bucket-0':ns_memcached:terminate:296] Shutting down bucket "bucket-0" on 'ns_1@10.1.3.77' for deletion
... (no other messages from this erlang process in between) ...
[ns_server:debug] [2012-05-10 8:49:38] [ns_1@10.1.3.77:'ns_memcached-bucket-0':ns_storage_conf:delete_bucket_db_directory:384] Result of deleting db directory: "/opt/couchbase/var/lib/couchbase/data/bucket-0-data": ok
That's what rebalance failure actually mentioned.
I've seen some memcached crash on this .77 node shortly before we've started bucket deletion. So my guess is that warmup prevents force shutdown to be quick. I.e.:
[error_logger:info] [2012-05-10 8:45:04] [ns_1@10.1.3.77:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_port_sup}
started: [{pid,<0.29206.11>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{supervisor_cushion,start_link,
[memcached,5000,ns_port_server,start_link,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000",
"-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
and prior to that:
[error_logger:error] [2012-05-10 8:45:03] [ns_1@10.1.3.77:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.15086.9>
registered_name: []
exception exit: {abnormal,134}
in function gen_server:terminate/6
ancestors: [<0.15085.9>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<0.60.0>]
messages: [{'EXIT',#Port<0.174933>,normal}]
links: [<0.15085.9>]
dictionary: []
trap_exit: true
status: running
heap_size: 121393
stack_size: 24
reductions: 1661845
neighbours:
and even more prior to that:
[ns_server:info] [2012-05-10 8:44:34] [ns_1@10.1.3.77:<0.15086.9>:ns_port_server:log:161] memcached<0.15086.9>: [warn] Epoll ADD(5) on fd 247 failed. Old events were 0; read change was 1 (add); write change was 1 (add): Bad file descriptor
memcached<0.15086.9>: Failed to add connection to libevent: Bad file descriptor
memcached<0.15086.9>: memcached: bucket_engine.c:997: handle_disconnect: Assertion `es' failed.