Hi all,
I have “inherited” a rather complex (and old) IT environment consisting of one database server (Oracle), three application servers, and two webservers. One app server runs Couchbase, one FAST ESP, and one Luke (Lucene). These three app servers feed two different websites on the two webservers.
I have an urgent issue where the Couchbase server crashes at exactly 2 AM. The service is then unavailable until next day at exactly 12 AM. The strange thing is that the crash window moves when we go from standard time to daylight saving time. During standard time the windows is from 1 AM to 11 AM. After the “crash window”, I restart Couchbase and everything is blistering fast again.
I have searched everywhere for a clue on this issue but can’t find anything. I have also performed a rather deep troubleshooting, but can’t find any clue to this behavior.
In Couchbase there is one Webcache bucket and one Memcached bucket serving the two websites in different ways.
I’m in no way a Couchbase expert (rather newbie), but I’m experienced in IT. As a bonus, the IT environment is mostly undocumented. I stumble in the dark here.
Has anyone of you heard of a similar issue? Do you have any tips on further troubleshooting?
Many thanks in advance!
Cheers, Johan
What do you have from the crash? Messages? Logs? Core?
Thanks for replying! I have a lot of information, in fact, I have so much that I don’t know where to begin. The first second of the crash in the error file is 609 rows. Is it okay to post that much text here? Should I post it in Blockquote or attach a textfile? Thanks again.
Try posting as a file. If that doesn’t work, post inline
Wouldn’t restarting manually restore service?
You said it crashes at 2am and is unavailable until 12am. So it’s only available the two hours from 12am to 2am? (or did you mean it is unavailable until 12pm (noon)?).
You might want to check for something scheduled to run at 2am. Or clients at 2am.
Or if you could track down what changed the same time as the 2am crashes started.
Sorry, I mean 12PM. I’m in Sweden so not used to AM/PM. The downtime is always ten hours exactly. It doesn’t help to restart the Couchbase service, and if it is running at 12PM I need to restart it for Couchbase to come alive. After the restart it only takes seconds for the websites to be blistering fast again.
I have uploaded a zip file with a snippet from the error file in var\lib\couchbase\logs containing the first second of the crash.
Thanks for your help,
Johan
Couchbase_errorlog_snippet_250623_0200.zip (3.0 KB)
If I search jira tickets for that error I find these.
The crash report looks somewhat like Jira
[stats:error,2025-06-23T2:00:02.317,ns_1@127.0.0.1:<0.12736.0>:stats_collector:handle_info:106]Exception in stats collector: {exit,
{{badmatch,{error,closed}},
{gen_server,call,
['ns_memcached-default',
{stats,<<>>},
180000]}},
[{gen_server,call,3},
{ns_memcached,do_call,3},
{stats_collector,grab_all_stats,1},
{stats_collector,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
Thanks, I will look into that. Is there a scheduler in Couchbase that I haven’t been able to find?
This sounds suspect …
" The server has a process that will periodically scan every key in RAM and compile them into a log, named access.log
as well as maintain a backup of this access log, named access.old
. The server can use this backup file during warmup if the most recent access log has been corrupted during warmup or node failure. By default this process runs initially at 2:00 GMT and will run again in 24- hour time periods after that point. You can configure this process to run at a different initial time and at a different fixed interval."
You could try deleting the access.log and access.old in case the problem is that they are corrupt. You could also try setting the a_log_sleep_time to a large value like 525600 (a year).
Thanks, that really looks interesting. It didn’t help to delete the access.log file. Now trying to see what I can find regarding the timing parameters. I can’t find that they are set today, which should mean 2AM UTC according to the manual. That is 4AM here in Sweden so it doesn’t line up with the crash at 2AM.
Maybe at one point someone set the start time to 4AM Sweden time?