Node stuck in error loop
Thu, 10/04/2012 - 11:46
It appears one of my nodes is in a state that it can not recover from:
[error_logger:error] [2012-10-04 14:38:56] [ns_1@10.10.10.54:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: mc_daemon:init/1
pid: <0.20072.0>
registered_name: []
exception exit: {function_clause,
[{mc_daemon,handle_info,
[{'DOWN',#Ref<0.0.29.240307>,process,<0.20076.0>,
{case_clause,{ok,<<"\"bar\"">>}}},
batching,
{state,<<"sms_fs">>,true,0,nil,[],
{115,[]},
116,
[{2356155904,20,
{delete,
<<"00632391-121D-47FF-8F6C-2E9E0847A651">>}},
{2339378688,20,
{delete,
<<"0022559A-EE25-4BE1-A256-84662EE1C755">>}}],
"tsrq",4,<0.20073.0>,nil,
[#Ref<0.0.29.240307>,#Ref<0.0.29.238277>],
#Port<0.26102>}]},
{gen_fsm,handle_msg,7},
{proc_lib,init_p_do_apply,3}]}
in function gen_fsm:terminate/7
ancestors: [<0.20071.0>]
messages: [{'$gen_sync_event',{<0.20071.0>,#Ref<0.0.29.240308>},
{163,116,
<<0,0,0,22,0,0,0,0,0,0,0,0>>,
518,
<<"01763FD4-B5A7-4763-949A-E538945896E3">>,
<<123,34,116,111,34,58,34,57,55,48,53,52,
49,48,51,52,48,34,44,34,67,65,83,34,58,
34,67,65,83,34,44,34,99,111,110,118,
101,114,115,97,116,105,111,110,73,100,
34,58,34,56,55,70,66,52,49,67,69,50,48,
52,68,56,53,65,66,66,55,68,70,56,52,65,
69,56,52,67,52,57,51,55,48,34,44,34,
114,101,109,111,116,101,68,111,99,73,
100,34,58,34,34,44,34,109,101,115,115,
97,103,101,67,111,117,110,116,34,58,34,
34,44,34,115,116,97,116,117,115,34,58,
34,80,83,84,78,34,44,34,100,105,114,
101,99,116,105,111,110,34,58,34,111,
117,116,34,44,34,108,97,115,116,85,112,
100,97,116,101,100,34,58,34,49,51,52,
57,50,50,50,57,48,56,34,44,34,102,114,
111,109,34,58,34,57,55,48,54,49,54,48,
50,48,55,34,44,34,109,115,103,34,58,34,
73,32,119,105,108,108,32,110,101,118,
101,114,32,116,101,120,116,32,121,117,
104,32,97,103,97,105,110,46,46,32,98,
121,101,32,34,44,34,116,105,109,101,83,
116,97,109,112,34,58,34,49,51,52,57,50,
50,50,57,48,56,34,44,34,105,115,80,97,
114,101,110,116,34,58,102,97,108,115,
101,44,34,100,111,99,73,100,34,58,34,
48,49,55,54,51,70,68,52,45,66,53,65,55,
45,52,55,54,51,45,57,52,57,65,45,69,53,
51,56,57,52,53,56,57,54,69,51,34,44,34,
99,111,110,118,101,114,115,97,116,105,
111,110,77,101,109,98,101,114,115,34,
58,91,93,44,34,105,115,83,101,110,116,
34,58,102,97,108,115,101,44,34,105,100,
67,117,115,116,111,109,101,114,77,97,
115,116,101,114,34,58,34,49,48,52,66,
52,53,49,55,45,50,69,69,50,45,52,56,48,
66,45,65,51,52,49,45,65,49,55,70,55,55,
55,69,57,50,49,50,34,44,34,105,100,80,
108,97,110,34,58,34,101,49,102,52,98,
101,49,51,45,56,102,97,56,45,52,51,57,
97,45,98,98,56,50,45,54,51,54,55,54,
101,98,57,49,54,98,102,34,44,34,109,
101,115,115,97,103,101,83,111,117,114,
99,101,34,58,34,34,125,1,20,0,0,0,1,0,
0,4,35,108,195,241,187,0,0,1,240,0,0,0,
0>>,
0,0}}]
links: [<0.20071.0>,<0.20073.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 75025
stack_size: 24
reductions: 51220
neighbours:
neighbour: [{pid,<0.20071.0>},
{registered_name,[]},
{initial_call,{mc_connection,init,1}},
{current_function,{gen,do_call,4}},
{ancestors,[]},
{messages,[]},
{links,[<0.518.0>,<0.20072.0>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,233},
{stack_size,16},
{reductions,130753}]
[error_logger:error] [2012-10-04 14:38:56] [ns_1@10.10.10.54:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.20073.0> terminating
** Last message in was {'EXIT',<0.20072.0>,
{function_clause,
[{mc_daemon,handle_info,
[{'DOWN',#Ref<0.0.29.240307>,process,<0.20076.0>,
{case_clause,{ok,<<"\"bar\"">>}}},
batching,
{state,<<"sms_fs">>,true,0,nil,[],
{115,[]},
116,
[{2356155904,20,
{delete,
<<"00632391-121D-47FF-8F6C-2E9E0847A651">>}},
{2339378688,20,
{delete,
<<"0022559A-EE25-4BE1-A256-84662EE1C755">>}}],
"tsrq",4,<0.20073.0>,nil,
[#Ref<0.0.29.240307>,#Ref<0.0.29.238277>],
#Port<0.26102>}]},
{gen_fsm,handle_msg,7},
{proc_lib,init_p_do_apply,3}]}}
** When Server state == {state,
{<0.20073.0>,mc_batch_sup},
simple_one_for_one,
[{child,undefined,mc_batch_sup,
{mc_batch_sup,start_link_worker,[]},
temporary,3600000,worker,[]}],
{set,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],
[<0.20075.0>],
[],[],[],[],[],[]}}},
0,1,[],mc_batch_sup,[]}
** Reason for termination ==
** {function_clause,
[{mc_daemon,handle_info,
[{'DOWN',#Ref<0.0.29.240307>,process,<0.20076.0>,
{case_clause,{ok,<<"\"bar\"">>}}},
batching,
{state,<<"sms_fs">>,true,0,nil,[],
{115,[]},
116,
[{2356155904,20,
{delete,<<"00632391-121D-47FF-8F6C-2E9E0847A651">>}},
{2339378688,20,
{delete,<<"0022559A-EE25-4BE1-A256-84662EE1C755">>}}],
"tsrq",4,<0.20073.0>,nil,
[#Ref<0.0.29.240307>,#Ref<0.0.29.238277>],
#Port<0.26102>}]},
{gen_fsm,handle_msg,7},
{proc_lib,init_p_do_apply,3}]}
[error_logger:error] [2012-10-04 14:38:56] [ns_1@10.10.10.54:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: supervisor:mc_batch_sup/1
pid: <0.20073.0>
registered_name: []
exception exit: {function_clause,
[{mc_daemon,handle_info,
[{'DOWN',#Ref<0.0.29.240307>,process,<0.20076.0>,
{case_clause,{ok,<<"\"bar\"">>}}},
batching,
{state,<<"sms_fs">>,true,0,nil,[],
{115,[]},
116,
[{2356155904,20,
{delete,
<<"00632391-121D-47FF-8F6C-2E9E0847A651">>}},
{2339378688,20,
{delete,
<<"0022559A-EE25-4BE1-A256-84662EE1C755">>}}],
"tsrq",4,<0.20073.0>,nil,
[#Ref<0.0.29.240307>,#Ref<0.0.29.238277>],
#Port<0.26102>}]},
{gen_fsm,handle_msg,7},
{proc_lib,init_p_do_apply,3}]}
in function gen_server:terminate/6
ancestors: [<0.20072.0>,<0.20071.0>]
messages: []
links: [<0.20075.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 75025
stack_size: 24
reductions: 5114
neighbours:
neighbour: [{pid,<0.20075.0>},
{registered_name,[]},
{initial_call,
{mc_batch_sup,sync_update_docs,
['Argument__1','Argument__2','Argument__3',
'Argument__4']}},
{current_function,{couch_db,get_result,2}},
{ancestors,[<0.20073.0>,<0.20072.0>,<0.20071.0>]},
{messages,[]},
{links,[<0.20073.0>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,75025},
{stack_size,17},
{reductions,93076}]That error is continuous. If I fail the node over, or remove it, the same error starts happening on
a different node.
The rest of the cluster (3 other nodes) is functioning just fine. However, I'm worried at some point
that this system will spiral down and move the problem throughout the cluster indefinitely.
I've tried deleting the keys specified in the error message, it says they don't exist. I've tried
creating them by hand on a different cluster and doing a backup and restore -a - same problem.
Any suggestions? Thanks.