Couchbase node crash
Hi all,
today our Cluster lost a node, why is not clear, see parts of the logs below. The node was fail overed and the cluster did still run. (As expected... ;)
Rejoining the cluster failed with some memcache errors on the faulty node.
What did help was removing the node from the cluster, purging the couchbase installaltion, re-setup the node and add the new blank node to the cluster.
Our Queston now is:
What went wrong?
Anybody got some clues?
Thanks,
ARabus
Some excerpts from the log:
First we got some heartbeat errors in the log:
[ns_1@192.168.70.204:system_stats_collector:system_stats_collector:handle_info:130] lost 1 ticks
The node did come back online however but failed to join the cluster.
Somethign with memcache errors
[error_logger:error] [2012-06-26 0:14:36] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_msg:76] ** State machine mb_master terminating
** Last message in was send_heartbeat
** When State == master
** When State == master
** Data == {state,<0.15647.1766>,'ns_1@192.168.70.204',
['ns_1@192.168.70.204','ns_1@192.168.70.227',
'ns_1@192.168.70.228'],
{1340,662463,384085}}
** Reason for termination =
** {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
[ns_server:error] [2012-06-26 0:14:38] [ns_1@192.168.70.204:'ns_memcached-assets':ns_memcached:handle_call:139] call {stats,
<<>>} took too long: 10535791 us
[ns_doctor:error] [2012-06-26 0:14:38] [ns_1@192.168.70.204:<0.535.0>:ns_doctor:get_nodes:153] Error attempting to get nodes: {exit,
{noproc,
{gen_server,
call,
[ns_doctor,
get_nodes]}}}
[menelaus:warn] [2012-06-26 0:14:47] [ns_1@192.168.70.204:<0.534.0>:menelaus_web:loop:357] Server error during processing: ["web request failed",
{path,
"/pools/default/bucketsStreaming/itunes"},
{type,
exit},
{what,
{timeout,
{gen_server,
call,
[ns_cookie_manager,
cookie_get]}}},
{trace,
[{gen_server,
call,
2},
{menelaus_web,
build_nodes_info_fun,
3},
{menelaus_web_buckets,
build_bucket_node_infos,
5},
{menelaus_web_buckets,
build_bucket_info,
5},
{menelaus_web,
streaming_inner,
3},
{menelaus_web,
handle_streaming,
4},
{menelaus_web_buckets,
checking_bucket_access,
4},
{menelaus_web,
loop,
3}]}]and sine crash report:
[error_logger:error] [2012-06-26 0:14:51] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: mb_master:init/1
pid: <0.15029.1766>
registered_name: mb_master
exception exit: {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
in function gen_fsm:terminate/7
ancestors: [ns_server_sup,ns_server_cluster_sup,<0.41.0>]
messages: [{'$gen_event',
{heartbeat,
[...]and a aupervisor erro:
[error_logger:error] [2012-06-26 0:14:53] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,ns_server_sup}
Context: child_terminated
Reason: {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
Offender: [{pid,<0.15029.1766>},
{name,mb_master},
{mfargs,{mb_master,start_link,[]}},
{restart_type,permanent},
{shutdown,infinity},
{child_type,supervisor}]
[ns_server:info] [2012-06-26 0:14:54] [ns_1@192.168.70.204:mb_master:mb_master:init:98] Starting as candidate. Peers: ['ns_1@192.168.70.204',
'ns_1@192.168.70.227',
'ns_1@192.168.70.228']
[ns_server:info] [2012-06-26 0:14:54] [ns_1@192.168.70.204:ns_config_rep:ns_config_rep:init:56] init pulling
[ns_server:info] [2012-06-26 0:14:54] [ns_1@192.168.70.204:mb_master:mb_master:candidate:244] Changing master from undefined to 'ns_1@192.168.70.227'
[ns_doctor:error] [2012-06-26 0:14:52] [ns_1@192.168.70.204:<0.535.0>:ns_doctor:get_nodes:153] Error attempting to get nodes: {exit,
{noproc,
{gen_server,
call,
[ns_doctor,
get_nodes]}}}
[ns_server:info] [2012-06-26 0:14:54] [ns_1@192.168.70.204:ns_node_disco_events:ns_node_disco_log:handle_event:46] ns_node_disco_log: nodes changed: ['ns_1@192.168.70.204',
'ns_1@192.168.70.227',
'ns_1@192.168.70.228']
[ns_server:info] [2012-06-26 0:14:54] [ns_1@192.168.70.204:ns_config_rep:ns_config_rep:do_pull:257] Pulling config from: 'ns_1@192.168.70.227'
[stats:error] [2012-06-26 0:14:54] [ns_1@192.168.70.204:<0.10196.0>:stats_collector:handle_info:95] Exception in stats collector: {exit,
{timeout,
{gen_server,
call,
[{'couch_stats_reader-assets',
'ns_1@192.168.70.204'},
fetch_stats]}},
[{gen_server,
call,
2},
{couch_stats_reader,
fetch_stats,
1},
{stats_collector,
grab_all_stats,
1},
{stats_collector,
handle_info,
2},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]}and a bit later an exception occured
[stats:error] [2012-06-26 0:15:05] [ns_1@192.168.70.204:<0.10189.0>:stats_collector:handle_info:95] Exception in stats collector: {exit,
{timeout,
{gen_server,
call,
[{'couch_stats_reader-itunes',
'ns_1@192.168.70.204'},
fetch_stats]}},
[{gen_server,
call,
2},
{couch_stats_reader,
fetch_stats,
1},
{stats_collector,
grab_all_stats,
1},
{stats_collector,
handle_info,
2},
{gen_server,
handle_msg,
5},
{proc_lib,
init_p_do_apply,
3}]}and after are-join we see that
[couchdb:info] [2012-06-26 9:18:32] [ns_1@192.168.70.204:<0.5857.2328>:couch_log:info:39] mccouch is listening on port 11213
[error_logger:error] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.5855.2328> on node 'ns_1@192.168.70.204' with exit value: {{badmatch,{error,closed}},[{mc_connection,respond,5},{mc_tap,'-process_tap_stream/5-fun-0-',8},{couch_btree,stream_kv_node2,8},{couch_btree,stream_kp_node,7},{couch_btree,fold,4},{couch_db,changes_since,5},{couch_db...
[error_logger:error] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.1450.2328> terminating
** Last message in was {'EXIT',<0.1449.2328>,
{{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}}
** When Server state == {state,
{<0.1450.2328>,mc_batch_sup},
simple_one_for_one,
[{child,undefined,mc_batch_sup,
{mc_batch_sup,start_link_worker,[]},
temporary,3600000,worker,[]}],
undefined,0,1,[],mc_batch_sup,[]}
** Reason for termination ==
** {{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}
[error_logger:error] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: supervisor:mc_batch_sup/1
pid: <0.1450.2328>
registered_name: []
exception exit: {{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}
in function gen_server:terminate/6
ancestors: [<0.1449.2328>,<0.1448.2328>]
messages: []
links: []
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 150
neighbours:
[error_logger:error] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_report:72]
=========================SUPERVISOR REPORT=========================
Supervisor: {local,mc_sup}
Context: child_terminated
Reason: {{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}
Offender: [{pid,<0.517.2328>},
{name,mc_tcp_listener},
{mfargs,{mc_tcp_listener,start_link,[11213]}},
{restart_type,permanent},
{shutdown,2000},
{child_type,worker}]
[error_logger:info] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,mc_sup}
started: [{pid,<0.5857.2328>},
{name,mc_tcp_listener},
{mfargs,{mc_tcp_listener,start_link,[11213]}},
{restart_type,permanent},
{shutdown,2000},
{child_type,worker}]
[error_logger:error] [2012-06-26 9:18:32] [ns_1@192.168.70.204:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.534.2328> terminating
** Last message in was {'EXIT',<0.533.2328>,
{{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}}
** When Server state == {state,
{<0.534.2328>,mc_batch_sup},
simple_one_for_one,
[{child,undefined,mc_batch_sup,
{mc_batch_sup,start_link_worker,[]},
temporary,3600000,worker,[]}],
undefined,0,1,[],mc_batch_sup,[]}
** Reason for termination ==
** {{badmatch,{error,closed}},
[{mc_connection,respond,5},
{mc_tap,'-process_tap_stream/5-fun-0-',8},
{couch_btree,stream_kv_node2,8},
{couch_btree,stream_kp_node,7},
{couch_btree,fold,4},
{couch_db,changes_since,5},
{couch_db,fast_reads,2},
{mc_tap,process_tap_stream,5}]}
[...]
[ns_server:info] [2012-06-26 9:18:32] [ns_1@192.168.70.204:<0.10041.0>:ns_port_server:log:161] memcached<0.10041.0>: Rubbish received on the backend stream. closing itAfter that no more errors but all request to that faulty node returns empty results.
Looking this over, I have no immediate answers as to why the re-added has been a problem. I'll see if I can get a colleague to look it over.