Details
Description
Have not seen this issue previously ever on 181. Seems to be something new.
Test case that failed
swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=3,num-buckets=5,num-swap=2,swap-orchestrator=True,keys-count=800000
1) Failover 2 nodes, including the orchestrator
2) Rebalance them out
3) Add them back and rebalance in, without cleaning them.
[rebalance:error] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27930.16>:ns_rebalancer:do_wait_buckets_shutdown:170] Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2",
"bucket-1",
"bucket-0",
"bucket-4",
"bucket-3"]}}}]
[user:info] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27893.16>:ns_orchestrator:handle_info:246] Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2","bucket-1","bucket-0",
"bucket-4","bucket-3"]}}}]}
Test case that failed
swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=3,num-buckets=5,num-swap=2,swap-orchestrator=True,keys-count=800000
1) Failover 2 nodes, including the orchestrator
2) Rebalance them out
3) Add them back and rebalance in, without cleaning them.
[rebalance:error] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27930.16>:ns_rebalancer:do_wait_buckets_shutdown:170] Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2",
"bucket-1",
"bucket-0",
"bucket-4",
"bucket-3"]}}}]
[user:info] [2012-05-10 8:48:59] [ns_1@10.1.3.74:<0.27893.16>:ns_orchestrator:handle_info:246] Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@10.1.3.77',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["bucket-2","bucket-1","bucket-0",
"bucket-4","bucket-3"]}}}]}
Activity
Karan Kumar
made changes -
| Field | Original Value | New Value |
|---|---|---|
| Fix Version/s | 1.8.1 [ 10295 ] | |
| Affects Version/s | 1.8.1-release-candidate [ 10299 ] | |
| Component/s | ns_server [ 10019 ] |
Aleksey Kondratenko
made changes -
| Assignee | Aleksey Kondratenko [ alkondratenko ] | Frank Weigel [ frank ] |
Dipti Borkar
made changes -
| Labels | customer | |
| Assignee | Frank Weigel [ frank ] | Aleksey Kondratenko [ alkondratenko ] |
Aleksey Kondratenko
made changes -
| Assignee | Aleksey Kondratenko [ alkondratenko ] | Dipti Borkar [ dipti ] |
Farshid Ghods
made changes -
| Summary | Rebalance failed due to buckets_shutdown_wait_failed | user is unable to add the node back to the cluster because the failed over node is stuck in warmup phase |
Dipti Borkar
made changes -
| Assignee | Dipti Borkar [ dipti ] | Chiyoung Seo [ chiyoung ] |
Dipti Borkar
made changes -
| Sprint Status | Current Sprint |
Dipti Borkar
made changes -
| Sprint Priority | 0 |
Aleksey Kondratenko
made changes -
| Assignee | Chiyoung Seo [ chiyoung ] | Aleksey Kondratenko [ alkondratenko ] |
Aleksey Kondratenko
made changes -
| Status | Open [ 1 ] | Resolved [ 5 ] |
| Resolution | Fixed [ 1 ] |
Dipti Borkar
made changes -
| Sprint Status | Current Sprint | |
| Sprint Priority | 0 |
Farshid Ghods
made changes -
| Status | Resolved [ 5 ] | Closed [ 6 ] |
[user:info] [2012-05-10 8:45:32] [ns_1@10.1.3.77:'ns_memcached-bucket-0':ns_memcached:terminate:296] Shutting down bucket "bucket-0" on 'ns_1@10.1.3.77' for deletion
... (no other messages from this erlang process in between) ...
[ns_server:debug] [2012-05-10 8:49:38] [ns_1@10.1.3.77:'ns_memcached-bucket-0':ns_storage_conf:delete_bucket_db_directory:384] Result of deleting db directory: "/opt/couchbase/var/lib/couchbase/data/bucket-0-data": ok
That's what rebalance failure actually mentioned.
I've seen some memcached crash on this .77 node shortly before we've started bucket deletion. So my guess is that warmup prevents force shutdown to be quick. I.e.:
[error_logger:info] [2012-05-10 8:45:04] [ns_1@10.1.3.77:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_port_sup}
started: [{pid,<0.29206.11>},
{name,
{memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]}},
{mfargs,
{supervisor_cushion,start_link,
[memcached,5000,ns_port_server,start_link,
[memcached,"/opt/couchbase/bin/memcached",
["-X",
"/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000",
"-E",
"/opt/couchbase/lib/memcached/bucket_engine.so",
"-B","binary","-r","-c","10000","-e",
"admin=_admin;default_bucket_name=default;auto_create=false",
[]],
[{env,
[{"EVENT_NOSELECT","1"},
{"MEMCACHED_TOP_KEYS","100"},
{"ISASL_PWFILE",
"/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
{"ISASL_DB_CHECK_TIME","1"}]},
use_stdio,stderr_to_stdout,exit_status,
port_server_send_eol,stream]]]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
and prior to that:
[error_logger:error] [2012-05-10 8:45:03] [ns_1@10.1.3.77:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.15086.9>
registered_name: []
exception exit: {abnormal,134}
in function gen_server:terminate/6
ancestors: [<0.15085.9>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
<0.60.0>]
messages: [{'EXIT',#Port<0.174933>,normal}]
links: [<0.15085.9>]
dictionary: []
trap_exit: true
status: running
heap_size: 121393
stack_size: 24
reductions: 1661845
neighbours:
and even more prior to that:
[ns_server:info] [2012-05-10 8:44:34] [ns_1@10.1.3.77:<0.15086.9>:ns_port_server:log:161] memcached<0.15086.9>: [warn] Epoll ADD(5) on fd 247 failed. Old events were 0; read change was 1 (add); write change was 1 (add): Bad file descriptor
memcached<0.15086.9>: Failed to add connection to libevent: Bad file descriptor
memcached<0.15086.9>: memcached: bucket_engine.c:997: handle_disconnect: Assertion `es' failed.