Memcached crash every 10 minutes

2 Nodes 10 buckets, on ubuntu 14.04, CB Community 4.5.0.

Memcached is crashing every 10 minutes or so with the same error:

Service 'memcached' exited with status 134. Restarting. Messages: 2017-03-27T04:40:15.586484-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d694bb1ed]
2017-03-27T04:40:15.586489-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d6951f310]
2017-03-27T04:40:15.586494-04:00 WARNING     /opt/couchbase/bin/../lib/libplatform.so.0.1.0() [0x7f0d6e5d284a]
2017-03-27T04:40:15.586500-04:00 WARNING     /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f0d6e3b7184]
2017-03-27T04:40:15.586524-04:00 WARNING     /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0d6cd0a37d]

Memcached log:

2017-03-27T04:40:15.586216-04:00 WARNING Breakpad caught crash in memcached. Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/37280e10-6d50-48a6-6c49bc08-530603f6.dmp before terminating.
2017-03-27T04:40:15.586248-04:00 WARNING Stack backtrace of crashed thread:
2017-03-27T04:40:15.586291-04:00 WARNING     /opt/couchbase/bin/memcached() [0x4252c4]
2017-03-27T04:40:15.586301-04:00 WARNING     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3d4) [0x45ac84]
2017-03-27T04:40:15.586307-04:00 WARNING     /opt/couchbase/bin/memcached() [0x45ae85]
2017-03-27T04:40:15.586312-04:00 WARNING     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0x97) [0x45afc7]
2017-03-27T04:40:15.586320-04:00 WARNING     /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f0d6e3bf330]
2017-03-27T04:40:15.586340-04:00 WARNING     /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f0d6cc46c37]
2017-03-27T04:40:15.586362-04:00 WARNING     /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f0d6cc4a028]
2017-03-27T04:40:15.586389-04:00 WARNING     /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155) [0x7f0d6d24b535]
2017-03-27T04:40:15.586398-04:00 WARNING     /opt/couchbase/bin/memcached() [0x427fcf]
2017-03-27T04:40:15.586419-04:00 WARNING     /usr/lib/x86_64-linux-gnu/libstdc++.so.6() [0x7f0d6d2496d6]
2017-03-27T04:40:15.586439-04:00 WARNING     /usr/lib/x86_64-linux-gnu/libstdc++.so.6() [0x7f0d6d249703]
2017-03-27T04:40:15.586458-04:00 WARNING     /usr/lib/x86_64-linux-gnu/libstdc++.so.6() [0x7f0d6d249922]
2017-03-27T04:40:15.586465-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d6957174d]
2017-03-27T04:40:15.586470-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d694bfa8c]
2017-03-27T04:40:15.586475-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d694bfe84]
2017-03-27T04:40:15.586480-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d694baf01]
2017-03-27T04:40:15.586484-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d694bb1ed]
2017-03-27T04:40:15.586489-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0d6951f310]
2017-03-27T04:40:15.586494-04:00 WARNING     /opt/couchbase/bin/../lib/libplatform.so.0.1.0() [0x7f0d6e5d284a]
2017-03-27T04:40:15.586500-04:00 WARNING     /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f0d6e3b7184]
2017-03-27T04:40:15.586524-04:00 WARNING     /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0d6cd0a37d]

babysitter.log for another incident

[ns_server:warn,2017-03-27T05:24:17.921-04:00,babysitter_of_ns_1@127.0.0.1:<0.135.0>:ns_port_server:log:215]Dropped 870 log lines from memcached
[error_logger:error,2017-03-27T05:24:46.812-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]** Generic server <0.135.0> terminating
** Last message in was {#Port<0.3586>,{exit_status,134}}
** When Server state == {state,#Port<0.3586>,memcached,
                               {["2017-03-27T05:24:17.755018-04:00 WARNING     /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f013465637d]",
                                 "2017-03-27T05:24:17.754989-04:00 WARNING     /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f0135d03184]",
                                 "2017-03-27T05:24:17.754980-04:00 WARNING     /opt/couchbase/bin/../lib/libplatform.so.0.1.0() [0x7f0135f1e84a]",
                                 "2017-03-27T05:24:17.754973-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0130f1f310]"],
                                ["2017-03-27T05:24:17.754966-04:00 WARNING     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0130ebb1ed]"]},
                               undefined,undefined,[],0}
** Reason for termination ==
** {abnormal,134}

[error_logger:error,2017-03-27T05:24:46.812-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_port_server:init/1
    pid: <0.135.0>
    registered_name: []
    exception exit: {abnormal,134}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [<0.134.0>,<0.133.0>,ns_child_ports_sup,ns_babysitter_sup,
                  <0.55.0>]
    messages: [{'EXIT',#Port<0.3586>,normal}]
    links: [<0.134.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 121536
    stack_size: 27
    reductions: 2315186
  neighbours:

[ns_server:info,2017-03-27T05:24:46.812-04:00,babysitter_of_ns_1@127.0.0.1:<0.134.0>:supervisor_cushion:handle_info:58]Cushion managed supervisor for memcached failed:  {abnormal,134}
[error_logger:error,2017-03-27T05:24:46.812-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]** Generic server <0.134.0> terminating
** Last message in was {die,{abnormal,134}}
** When Server state == {state,memcached,5000,
                               {1490,606021,818667},
                               undefined,infinity}
** Reason for termination ==
** {abnormal,134}

[error_logger:error,2017-03-27T05:24:46.813-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: supervisor_cushion:init/1
    pid: <0.134.0>
    registered_name: []
    exception exit: {abnormal,134}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [<0.133.0>,ns_child_ports_sup,ns_babysitter_sup,<0.55.0>]
    messages: []
    links: [<0.133.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 2000
  neighbours:

[ns_server:debug,2017-03-27T05:24:46.813-04:00,babysitter_of_ns_1@127.0.0.1:<0.137.0>:supervisor_cushion:init:39]starting ns_port_server with delay of 5000
[ns_server:debug,2017-03-27T05:24:46.813-04:00,babysitter_of_ns_1@127.0.0.1:<0.136.0>:restartable:start_child:98]Started child process <0.137.0>
  MFA: {supervisor_cushion,start_link,
                           [memcached,5000,infinity,ns_port_server,start_link,
                            [#Fun<ns_child_ports_sup.2.49698737>]]}
[error_logger:error,2017-03-27T05:24:46.813-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.133.0>
    registered_name: []
    exception exit: {abnormal,134}
      in function  restartable:loop/4 (src/restartable.erl, line 69)
    ancestors: [ns_child_ports_sup,ns_babysitter_sup,<0.55.0>]
    messages: []
    links: [<0.68.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 1472
  neighbours:

[error_logger:error,2017-03-27T05:24:46.813-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================SUPERVISOR REPORT=========================
     Supervisor: {local,ns_child_ports_sup}
     Context:    child_terminated
     Reason:     {abnormal,134}
     Offender:   [{pid,<0.133.0>},
                  {name,
                      {memcached,"/opt/couchbase/bin/memcached",
                          ["-C",
                           "/opt/couchbase/var/lib/couchbase/config/memcached.json"],
                          [{env,
                               [{"EVENT_NOSELECT","1"},
                                {"MEMCACHED_TOP_KEYS","5"},
                                {"ISASL_PWFILE",
                                 "/opt/couchbase/var/lib/couchbase/isasl.pw"}]},
                           use_stdio,stderr_to_stdout,exit_status,
                           port_server_dont_start,stream]}},
                  {mfargs,
                      {restartable,start_link,
                          [{supervisor_cushion,start_link,
                               [memcached,5000,infinity,ns_port_server,
                                start_link,
                                [#Fun<ns_child_ports_sup.2.49698737>]]},
                           86400000]}},
                  {restart_type,permanent},
                  {shutdown,infinity},
                  {child_type,worker}]


[error_logger:info,2017-03-27T05:24:46.814-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================PROGRESS REPORT=========================
          supervisor: {local,ns_child_ports_sup}
             started: [{pid,<0.136.0>},
                       {name,
                           {memcached,"/opt/couchbase/bin/memcached",
                               ["-C",
                                "/opt/couchbase/var/lib/couchbase/config/memcached.json"],
                               [{env,
                                    [{"EVENT_NOSELECT","1"},
                                     {"MEMCACHED_TOP_KEYS","5"},
                                     {"ISASL_PWFILE",
                                      "/opt/couchbase/var/lib/couchbase/isasl.pw"}]},
                                use_stdio,stderr_to_stdout,exit_status,
                                port_server_dont_start,stream]}},
                       {mfargs,
                           {restartable,start_link,
                               [{supervisor_cushion,start_link,
                                    [memcached,5000,infinity,ns_port_server,
                                     start_link,
                                     [#Fun<ns_child_ports_sup.2.49698737>]]},
                                86400000]}},
                       {restart_type,permanent},
                       {shutdown,infinity},
                       {child_type,worker}]

[ns_server:info,2017-03-27T05:25:03.176-04:00,babysitter_of_ns_1@127.0.0.1:<0.138.0>:ns_port_server:log:210]memcached<0.138.0>: 2017-03-27T05:25:02.975531-04:00 NOTICE Couchbase version 4.5.0-2601 starting

What other logs would be useful diagnosing this issue?

The referenced Minidump file from breakpad should show the backtrace of where the crash occurred. You can convert this into a core file using minidump-2-core - e.g.

/opt/couchbase/bin/minidump-2-core XXX.dmp > mcd_crash.core

If you then install the symbol files for your version (likely https://packages.couchbase.com.s3.amazonaws.com/releases/4.5.0/couchbase-server-community-dbg_4.5.0-ubuntu14.04_amd64.deb) and then run GDB you should be able to get a symbolised backtrace of the crash:

gdb -c mdc_crash.core /opt/couchbase/bin/memcached
thread apply all backtrace full

If you add that to this thread (along with the previous 100 or so lines from the memcached.log leading up to the crash) we can see if it’s a known issue or something new.

Thanks for the reply here are the logs:

backtrace: http://termbin.com/pp13
memcached.log: http://termbin.com/9otg

Thanks for the logs & backtrace. It looks like that couchstore (our on-disk format) has failed to decompress part of one of the files during DCP backfill - i.e. when we’re replicating - due to some form of file corruption.

Unfortunately it’s not clear which vBucket file has the issue - the variable holding the vbID has been optimised out of the backtrace:

#7  0x00007f0d6957174d in CouchKVStore::initScanContext (this=<optimized out>, cb=<error reading variable: Cannot access memory at address 0x7f0b544a2728>, 
    cl=<error reading variable: Cannot access memory at address 0x7f0b544a2768>, vbid=<optimized out>, startSeqno=1, options=ALL_ITEMS, 
    valOptions=VALUES_DECOMPRESSED) at /home/couchbase/jenkins/workspace/watson-unix/ep-engine/src/couch-kvstore/couch-kvstore.cc:1150
        err = <error reading variable: Cannot access memory at address 0x7f0c22a3aaf0>
        db = 0x7f0b89de8800
        errorCode = <optimized out>
        info = {filename = 0x7f0b540d43b0 <error: Cannot access memory at address 0x7f0b540d43b0>, last_sequence = 204612, doc_count = 22700, 
          deleted_count = 39, space_used = 15592853, file_size = 24137819, header_position = 24137728, purge_seq = 0}
        count = 0
        lh = <optimized out>
        backfillId = <optimized out>

Additionally, the errorCode is also optimised out but from examining the code the only likely error code is COUCHSTORE_ERROR_CORRUPT.

From the memcached.log, the last vBucket to be handled was vb:166, so at a guess maybe 167 is the problem child?

You can try using couch_dbdump on that file to see if it can be successfully read, if not you can try cbbackup on it to recover that data is still left.

I’ve also raised an issue to track improving this case, so we don’t crash if it occurs again - https://issues.couchbase.com/browse/MB-23550

Thanks. I can confirm it’s caused by a corrupt .couch file.

cbbackup returns

# cbbackup couchstore-files:///mnt/couchbase-dump/ /mnt/couchbase-bkup
error: could not read _local/vbstate from: /mnt/couchbase-dump/prod_cmo_decl_states/338.couch.483; exception: checksum fail