I have “inherited” a rather complex (and old) IT environment consisting of one database server (Oracle), three application servers, and two webservers. One app server runs Couchbase, one FAST ESP, and one Luke (Lucene). These three app servers feed two different websites on the two webservers.
I have an urgent issue where the Couchbase server crashes at exactly 2 AM. The service is then unavailable until next day at exactly 12 AM. The strange thing is that the crash window moves when we go from standard time to daylight saving time. During standard time the windows is from 1 AM to 11 AM. After the “crash window”, I restart Couchbase and everything is blistering fast again.
I have searched everywhere for a clue on this issue but can’t find anything. I have also performed a rather deep troubleshooting, but can’t find any clue to this behavior.
In Couchbase there is one Webcache bucket and one Memcached bucket serving the two websites in different ways.
I’m in no way a Couchbase expert (rather newbie), but I’m experienced in IT. As a bonus, the IT environment is mostly undocumented. I stumble in the dark here.
Has anyone of you heard of a similar issue? Do you have any tips on further troubleshooting?
Thanks for replying! I have a lot of information, in fact, I have so much that I don’t know where to begin. The first second of the crash in the error file is 609 rows. Is it okay to post that much text here? Should I post it in Blockquote or attach a textfile? Thanks again.
Wouldn’t restarting manually restore service?
You said it crashes at 2am and is unavailable until 12am. So it’s only available the two hours from 12am to 2am? (or did you mean it is unavailable until 12pm (noon)?).
You might want to check for something scheduled to run at 2am. Or clients at 2am.
Or if you could track down what changed the same time as the 2am crashes started.
Sorry, I mean 12PM. I’m in Sweden so not used to AM/PM. The downtime is always ten hours exactly. It doesn’t help to restart the Couchbase service, and if it is running at 12PM I need to restart it for Couchbase to come alive. After the restart it only takes seconds for the websites to be blistering fast again.
I have uploaded a zip file with a snippet from the error file in var\lib\couchbase\logs containing the first second of the crash.
" The server has a process that will periodically scan every key in RAM and compile them into a log, named access.log as well as maintain a backup of this access log, named access.old . The server can use this backup file during warmup if the most recent access log has been corrupted during warmup or node failure. By default this process runs initially at 2:00 GMT and will run again in 24- hour time periods after that point. You can configure this process to run at a different initial time and at a different fixed interval."
You could try deleting the access.log and access.old in case the problem is that they are corrupt. You could also try setting the a_log_sleep_time to a large value like 525600 (a year).
Thanks, that really looks interesting. It didn’t help to delete the access.log file. Now trying to see what I can find regarding the timing parameters. I can’t find that they are set today, which should mean 2AM UTC according to the manual. That is 4AM here in Sweden so it doesn’t line up with the crash at 2AM.
Looked in memcached log files to see it it did print out anything
if this is Linux, ensure that memcached can create a corefile and just collect a callstack from that. To do so one can just change the startup script (couchbase-server.sh) and add something like ulimit -c unlimited and wait for the next crash to appear. then use gdb /opt/couchbase/bin/memcached corefile and in there execute thread apply all bt