Membase 1.7.0 keeps crashing/hanging upon (re)start
Starting Membase 1.7.0 after its initial upgrade (from 1.6.5) worked flawlessly, but ever since restarting the service is giving major problems: Most of the time it won't come back up, either crashing right away or hanging indefinitely.
As long as I just keep restarting the service at some point it will come back up, but it might take ten to twenty attempts for it to start properly. During most restarts the erlsrv.exe process simply (appears to) crash within seconds. In some cases it just gets stuck at 25% CPU (quad-core system).
Attached the relevant output of ":8091/diag" in chronological order from the point where I first attempted to restart the service until the point where it actually managed to get back up again (some duplicate entries omitted). It's a testing set-up so I gave up at the end of the day and started again the following morning. Interestingly the amount of "stats_archiver" entries in each crash report seems to decrease up until the point where the service actually starts properly again. I also have an "erl_crash.dump" from one of the related erlsrv.exe crashes, should that be helpful in further diagnosing.
I'm running the Windows x64 version of Windows Server 2008 (not R2).
CRASH REPORT <0.68.0> 2011-06-29 18:23:41
===============================================================================
Crashing process
initial_call {mb_mnesia,init,['Argument__1']}
pid <0.68.0>
registered_name []
error_info
{exit,{{badmatch,{timeout,['stats_archiver-Statistics-week',
'stats_archiver-default-week',
'stats_archiver-Session-week',
'stats_archiver-MediaWiki-week',
'stats_archiver-Session-day',
'stats_archiver-default-day',
'stats_archiver-Statistics-day',
'stats_archiver-MediaWiki-month',
'stats_archiver-default-month',
'stats_archiver-Statistics-month',
'stats_archiver-Session-month',
'stats_archiver-Statistics-minute',
'stats_archiver-Session-minute',
'stats_archiver-default-minute',
'stats_archiver-default-year',
'stats_archiver-Session-hour',
'stats_archiver-Statistics-year',
'stats_archiver-MediaWiki-year',
'stats_archiver-default-hour',
'stats_archiver-Session-year']}},
[{mb_mnesia,ensure_schema,0},
{mb_mnesia,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}
ancestors [mb_mnesia_sup,ns_server_cluster_sup,<0.59.0>]
messages []
links [<0.82.0>,<0.66.0>]
dictionary []
trap_exit true
status running
heap_size 610
stack_size 24
reductions 7994CRASH REPORT <0.68.0> 2011-06-29 18:25:54
===============================================================================
Crashing process
initial_call {mb_mnesia,init,['Argument__1']}
pid <0.68.0>
registered_name []
error_info
{exit,{{badmatch,{timeout,['stats_archiver-Statistics-week',
'stats_archiver-default-week',
'stats_archiver-Session-week',
'stats_archiver-MediaWiki-week',
'stats_archiver-Session-day',
'stats_archiver-default-day',
'stats_archiver-Statistics-day',
'stats_archiver-MediaWiki-month',
'stats_archiver-default-month',
'stats_archiver-Statistics-month',
'stats_archiver-Session-month',
'stats_archiver-Statistics-minute',
'stats_archiver-Session-minute',
'stats_archiver-default-minute',
'stats_archiver-default-year',
'stats_archiver-Session-hour',
'stats_archiver-Statistics-year',
'stats_archiver-MediaWiki-year',
'stats_archiver-default-hour',
'stats_archiver-Session-year']}},
[{mb_mnesia,ensure_schema,0},
{mb_mnesia,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}
ancestors [mb_mnesia_sup,ns_server_cluster_sup,<0.59.0>]
messages []
links [<0.82.0>,<0.66.0>]
dictionary []
trap_exit true
status running
heap_size 610
stack_size 24
reductions 7994CRASH REPORT <0.68.0> 2011-06-30 08:53:45
===============================================================================
Crashing process
initial_call {mb_mnesia,init,['Argument__1']}
pid <0.68.0>
registered_name []
error_info
{exit,{{badmatch,{timeout,['stats_archiver-Statistics-week',
'stats_archiver-default-week',
'stats_archiver-default-day',
'stats_archiver-default-month',
'stats_archiver-default-minute',
'stats_archiver-default-year',
'stats_archiver-default-hour']}},
[{mb_mnesia,ensure_schema,0},
{mb_mnesia,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}
ancestors [mb_mnesia_sup,ns_server_cluster_sup,<0.59.0>]
messages []
links [<0.82.0>,<0.66.0>]
dictionary []
trap_exit true
status running
heap_size 610
stack_size 24
reductions 7985CRASH REPORT <0.68.0> 2011-06-30 09:51:41
===============================================================================
Crashing process
initial_call {mb_mnesia,init,['Argument__1']}
pid <0.68.0>
registered_name []
error_info
{exit,{{badmatch,{timeout,['stats_archiver-default-week',
'stats_archiver-default-month',
'stats_archiver-default-year']}},
[{mb_mnesia,ensure_schema,0},
{mb_mnesia,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}
ancestors [mb_mnesia_sup,ns_server_cluster_sup,<0.59.0>]
messages []
links [<0.82.0>,<0.66.0>]
dictionary []
trap_exit true
status running
heap_size 610
stack_size 24
reductions 7985Thanks, the workaround indeed solves the issue. Just a shame that it means deleting all archived statistics...
If there's anything I can do (c.q. additional information I can provide) to help resolve the issue please let me know.
I just had a similar issue on one of our server which is running: 1.7.2r-20-g6604356
OS: Windows 2003 server.
The service seemed to be running but you could not access membase. Even rebooting the system did not help.
You can find the full log here:
http://www63.zippyshare.com/v/19241756/file.html
According to: http://www.couchbase.com/issues/browse/MB-4006, it was meant to be fixed in 1.7.1
The work around (Deleting the mnesia folder) did worked, but is not something that is acceptable when we deploy our solution to our Enterprise clients.
Is the "Fix Version/s" wrong ? Was it not included in 1.7.2 ? Should this issue be re-opened ? Or am I facing a new issue?
The issue has not occurred since the specified work around has been applied.
However, we are still concerned that this may affect our customers.
Should I open my own thread ? I just thought the it's the same issue, so I posted it on here.
Thanks,
dc
Can anyone help with this ?
Thanks,
dc
The bug there was reopened on the 17th of April. It seems to be something being looked at.
Thanks @ingenthr!
I didn't realize it has been re-opened.
thijs,
This is a known issue in 1.7.0. can you please apply the suggested workaround on the nodes where membase hangs or crashes during startup.
http://www.couchbase.org/issues/browse/MB-4006
the workaround is to remove all files under /opt/membase/var/lib/membase/mnesia/*.* .
on windows platform these files are located in program_files\membase\server\...