[MB-6806] memcached crashed in flush_buffer -> Rebalance exited with reason {mover_failed,{badmatch,{error,closed}}} Created: 03/Oct/12  Updated: 10/Jan/13  Resolved: 15/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-beta
Fix Version/s: 2.0-beta-2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Andrei Baranouski Assignee: Jin Lim
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.92-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.93-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.94-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.95-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.96-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.97-diag.txt.gz     GZip Archive 7b00e100-a4c8-4b7b-ae5c-ddecb16f293a-10.3.121.98-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.92-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.93-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.94-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.95-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.96-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.97-diag.txt.gz     GZip Archive b92cc26f-59fc-4dc4-8ddf-d43bdfa58e32-10.3.121.98-diag.txt.gz    

 Description   
2.0.0-1787-rel
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/57/console
./testrunner -i /tmp/rebalance_regression.ini get-logs=True,disabled_consistent_view=False -t rebalancetests.IncrementalRebalanceInTests.test_load,replica=1,delete-ratio=0.6,expiry-ratio=0.2,keys-count=1000000

(disabled_consistent_view=False)


 gdb /opt/couchbase/bin/memcached core.memcached.28393
GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New Thread 28410]
[New Thread 28411]
[New Thread 28413]
[New Thread 28393]
[New Thread 28401]
[New Thread 28414]
[New Thread 28412]
[New Thread 28405]
[New Thread 28402]
[New Thread 28406]
[New Thread 28404]
[New Thread 28408]
[New Thread 28407]
[New Thread 28409]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /opt/couchbase/lib/memcached/libmemcached_utilities.so.0...done.
Loaded symbols for /opt/couchbase/lib/memcached/libmemcached_utilities.so.0
Reading symbols from /opt/couchbase/lib/libevent-2.0.so.5...done.
Loaded symbols for /opt/couchbase/lib/libevent-2.0.so.5
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
Reading symbols from /opt/couchbase/lib/libtcmalloc_minimal.so.4...done.
Loaded symbols for /opt/couchbase/lib/libtcmalloc_minimal.so.4
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
Reading symbols from /opt/couchbase/lib/memcached/stdin_term_handler.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/stdin_term_handler.so
Reading symbols from /opt/couchbase/lib/memcached/file_logger.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/file_logger.so
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1
Reading symbols from /opt/couchbase/lib/memcached/bucket_engine.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/bucket_engine.so
Reading symbols from /opt/couchbase/lib/memcached/ep.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/ep.so
Reading symbols from /opt/couchbase/lib/libcouchstore.so.1...done.
Loaded symbols for /opt/couchbase/lib/libcouchstore.so.1
Reading symbols from /opt/couchbase/lib/libsnappy.so.1...done.
Loaded symbols for /opt/couchbase/lib/libsnappy.so.1
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f2d34a116db in flush_buffer (handle=<value optimized out>, buf=0x7f2d33df2490, nbyte=8, offset=180315) at src/iobuffer.c:106
106 src/iobuffer.c: No such file or directory.
in src/iobuffer.c
(gdb) t a a bt

Thread 14 (Thread 28409):
#0 0x00007f2d3950d4ed in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3953e914 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f2d34c7c525 in updateStatsThread (arg=<value optimized out>) at src/memory_tracker.cc:31
#3 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 13 (Thread 28407):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2a00, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2a00, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x143d6e8) at daemon/thread.c:301
#4 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 12 (Thread 28408):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2780, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2780, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x143d7e0) at daemon/thread.c:301
#4 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 28404):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2500, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2500, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x143d400) at daemon/thread.c:301
#4 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 10 (Thread 28406):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2c80, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2c80, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x143d5f0) at daemon/thread.c:301
#4 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 9 (Thread 28402):
#0 0x00007f2d39800f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2d3833b176 in logger_thead_main (arg=<value optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 8 (Thread 28405):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2280, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2280, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x143d4f8) at daemon/thread.c:301
#4 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 7 (Thread 28412):
#0 0x00007f2d39800f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2d34c3cd68 in wait (this=0x14422d0, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x14422d0, d=...) at src/dispatcher.cc:336
#3 0x00007f2d34c3fd1a in Dispatcher::run (this=0x5cef6c0) at src/dispatcher.cc:173
#4 0x00007f2d34c4061b in launch_dispatcher_thread (arg=0x5cef714) at src/dispatcher.cc:28
#5 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 6 (Thread 28414):
#0 0x00007f2d39800f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2d34c5bd9f in wait (this=0x5cf8000) at src/syncobject.hh:58
#2 wait (this=0x5cf8000) at src/syncobject.hh:74
#3 wait (this=0x5cf8000) at src/tapconnmap.hh:169
#4 EventuallyPersistentEngine::notifyPendingConnections (this=0x5cf8000) at src/ep_engine.cc:3395
#5 0x00007f2d34c5be83 in EvpNotifyPendingConns (arg=0x5cf8000) at src/ep_engine.cc:1122
#6 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000000000 in ?? ()

Thread 5 (Thread 28401):
#0 0x00007f2d395372ed in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d394d2798 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f2d394d37be in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f2d394c78fa in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f2d394c67ca in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f2d38d3eb19 in fgets (arg=<value optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<value optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 4 (Thread 28393):
#0 0x00007f2d39546633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2d3a314f36 in epoll_dispatch (base=0x5ca2000, tv=<value optimized out>) at epoll.c:404
#2 0x00007f2d3a300394 in event_base_loop (base=0x5ca2000, flags=<value optimized out>) at event.c:1558
---Type <return> to continue, or q <return> to quit---
#3 0x000000000040c2d1 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7912

Thread 3 (Thread 28413):
#0 0x00007f2d39800f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2d34c3cd68 in wait (this=0x1442240, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x1442240, d=...) at src/dispatcher.cc:336
#3 0x00007f2d34c3fd1a in Dispatcher::run (this=0x5cef500) at src/dispatcher.cc:173
#4 0x00007f2d34c4061b in launch_dispatcher_thread (arg=0x5cef554) at src/dispatcher.cc:28
#5 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 2 (Thread 28411):
#0 0x00007f2d39800f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2d34c3cd68 in wait (this=0x1442360, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x1442360, d=...) at src/dispatcher.cc:336
#3 0x00007f2d34c3fd1a in Dispatcher::run (this=0x5cef880) at src/dispatcher.cc:173
#4 0x00007f2d34c4061b in launch_dispatcher_thread (arg=0x5cef8d4) at src/dispatcher.cc:28
#5 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 1 (Thread 28410):
#0 0x00007f2d34a116db in flush_buffer (handle=<value optimized out>, buf=0x7f2d33df2490, nbyte=8, offset=180315) at src/iobuffer.c:106
#1 buffered_pwrite (handle=<value optimized out>, buf=0x7f2d33df2490, nbyte=8, offset=180315) at src/iobuffer.c:331
#2 0x00007f2d34a10434 in raw_write (db=0x5ca8620, buf=0x7f2d33df24e0, pos=<value optimized out>, disk_size=<value optimized out>)
    at src/couch_file_write.c:38
#3 db_write_buf (db=0x5ca8620, buf=0x7f2d33df24e0, pos=<value optimized out>, disk_size=<value optimized out>)
    at src/couch_file_write.c:98
#4 0x00007f2d34a10629 in db_write_buf_compressed (db=0x5ca8620, buf=0xf528048, pos=0x7f2d33df25c8, disk_size=0x7f2d33df25d8)
    at src/couch_file_write.c:136
#5 0x00007f2d34a0f9eb in write_doc (db=<value optimized out>, docs=0x1436420, infos=0x1436288, numdocs=<value optimized out>, options=1)
    at src/couch_save.c:45
#6 add_doc_to_update_list (db=<value optimized out>, docs=0x1436420, infos=0x1436288, numdocs=<value optimized out>, options=1)
    at src/couch_save.c:272
#7 couchstore_save_documents (db=<value optimized out>, docs=0x1436420, infos=0x1436288, numdocs=<value optimized out>, options=1)
    at src/couch_save.c:344
#8 0x00007f2d34cba107 in CouchKVStore::saveDocs (this=0x5d10000, vbid=212, rev=<value optimized out>, docs=<value optimized out>,
    docinfos=0x1436288, docCount=1) at src/couch-kvstore/couch-kvstore.cc:1539
#9 0x00007f2d34cba871 in CouchKVStore::commit2couchstore (this=0x5d10000) at src/couch-kvstore/couch-kvstore.cc:1474
#10 0x00007f2d34cbaba9 in queueItem (this=0x5d10000, itm=..., cb=<value optimized out>) at src/couch-kvstore/couch-kvstore.cc:1606
#11 CouchKVStore::set (this=0x5d10000, itm=..., cb=<value optimized out>) at src/couch-kvstore/couch-kvstore.cc:343


#12 0x00007f2d34c4bdd6 in EventuallyPersistentStore::flushOneDelOrSet (this=0x143f200, qi=..., rejectQueue=<value optimized out>)
    at src/ep.cc:2165
#13 0x00007f2d34c50ac4 in EventuallyPersistentStore::flushOne (this=0x143f200, q=0x143f4c0, rejectQueue=0xb6f7a40) at src/ep.cc:2205
#14 0x00007f2d34c50c38 in EventuallyPersistentStore::flushSome (this=0x143f200, q=0x143f4c0, rejectQueue=0xb6f7a40) at src/ep.cc:1854
#15 0x00007f2d34c769d9 in Flusher::doFlush (this=0x14425a0) at src/flusher.cc:245
#16 0x00007f2d34c77455 in Flusher::step (this=0xe2075f0, d=..., tid=...) at src/flusher.cc:158
#17 0x00007f2d34c3fd1a in Dispatcher::run (this=0x5ceea80) at src/dispatcher.cc:173
---Type <return> to continue, or q <return> to quit---
#18 0x00007f2d34c4061b in launch_dispatcher_thread (arg=0xe2075f0) at src/dispatcher.cc:28
#19 0x00007f2d397fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007f2d39545fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x0000000000000000 in ?? ()


[ns_server:info,2012-10-03T7:09:28.066,ns_1@10.3.121.94:<0.32168.25>:mc_connection:run_loop:214]mccouch connection was normally closed
[ns_server:debug,2012-10-03T7:09:28.067,ns_1@10.3.121.94:'ns_memcached-bucket-0':ns_memcached:handle_info:624]Got {'EXIT',<0.32170.25>,{badmatch,{error,closed}}}. Exiting.
[user:info,2012-10-03T7:09:28.074,ns_1@10.3.121.94:'ns_memcached-bucket-0':ns_memcached:terminate:680]Control connection to memcached on 'ns_1@10.3.121.94' disconnected: {badmatch,
                                                                     {error,
                                                                      closed}}
[error_logger:error,2012-10-03T7:09:28.074,ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.32170.25>
    registered_name: []
    exception error: no match of right hand side value {error,closed}
      in function mc_binary:quick_stats_recv/3
      in call from mc_binary:quick_stats_loop/5
      in call from mc_binary:quick_stats/5
      in call from ns_memcached:do_handle_call/3
      in call from ns_memcached:worker_loop/3
    ancestors: ['ns_memcached-bucket-0','single_bucket_sup-bucket-0',
                  <0.32153.25>]
    messages: []
    links: [<0.32167.25>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 28657
    stack_size: 24
    reductions: 2877355
  neighbours:

[user:info,2012-10-03T7:09:28.074,ns_1@10.3.121.94:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.3.121.94' exited with status 139. Restarting. Messages: Wed Oct 3 07:09:07.390733 PDT 3: TAP (Consumer) eq_tapq:anon_943 - disconnected
Wed Oct 3 07:09:07.498627 PDT 3: Schedule cleanup of "eq_tapq:anon_942"
Wed Oct 3 07:09:07.498687 PDT 3: Schedule cleanup of "eq_tapq:anon_943"
Wed Oct 3 07:09:07.498682 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:07.804338 PDT 3: TAP (Consumer) eq_tapq:anon_944 - disconnected
Wed Oct 3 07:09:07.925300 PDT 3: TAP (Consumer) eq_tapq:anon_945 - disconnected
Wed Oct 3 07:09:08.020906 PDT 3: Schedule cleanup of "eq_tapq:anon_944"
Wed Oct 3 07:09:08.020986 PDT 3: Schedule cleanup of "eq_tapq:anon_945"
Wed Oct 3 07:09:08.020980 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:08.404868 PDT 3: TAP (Consumer) eq_tapq:anon_946 - disconnected
Wed Oct 3 07:09:08.561115 PDT 3: TAP (Consumer) eq_tapq:anon_947 - disconnected
Wed Oct 3 07:09:08.655894 PDT 3: Schedule cleanup of "eq_tapq:anon_946"
Wed Oct 3 07:09:08.655922 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:08.655968 PDT 3: Schedule cleanup of "eq_tapq:anon_947"
Wed Oct 3 07:09:09.018520 PDT 3: TAP (Consumer) eq_tapq:anon_948 - disconnected
Wed Oct 3 07:09:09.180208 PDT 3: TAP (Consumer) eq_tapq:anon_949 - disconnected
Wed Oct 3 07:09:09.310722 PDT 3: Schedule cleanup of "eq_tapq:anon_948"
Wed Oct 3 07:09:09.310749 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:09.310803 PDT 3: Schedule cleanup of "eq_tapq:anon_949"
Wed Oct 3 07:09:10.667048 PDT 3: TAP (Consumer) eq_tapq:anon_950 - disconnected
Wed Oct 3 07:09:10.802652 PDT 3: TAP (Consumer) eq_tapq:anon_951 - disconnected
Wed Oct 3 07:09:10.893904 PDT 3: Schedule cleanup of "eq_tapq:anon_950"
Wed Oct 3 07:09:10.893977 PDT 3: Schedule cleanup of "eq_tapq:anon_951"
Wed Oct 3 07:09:10.893940 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:12.259851 PDT 3: TAP (Consumer) eq_tapq:anon_952 - disconnected
Wed Oct 3 07:09:12.376843 PDT 3: TAP (Consumer) eq_tapq:anon_953 - disconnected
Wed Oct 3 07:09:12.485074 PDT 3: Schedule cleanup of "eq_tapq:anon_952"
Wed Oct 3 07:09:12.485107 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:12.485148 PDT 3: Schedule cleanup of "eq_tapq:anon_953"
Wed Oct 3 07:09:12.849159 PDT 3: TAP (Consumer) eq_tapq:anon_954 - disconnected
Wed Oct 3 07:09:13.002880 PDT 3: TAP (Consumer) eq_tapq:anon_955 - disconnected
Wed Oct 3 07:09:13.109536 PDT 3: Schedule cleanup of "eq_tapq:anon_954"
Wed Oct 3 07:09:13.109576 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:13.109626 PDT 3: Schedule cleanup of "eq_tapq:anon_955"
Wed Oct 3 07:09:13.529732 PDT 3: TAP (Consumer) eq_tapq:anon_956 - disconnected
Wed Oct 3 07:09:13.714518 PDT 3: TAP (Consumer) eq_tapq:anon_957 - disconnected
Wed Oct 3 07:09:13.815423 PDT 3: Schedule cleanup of "eq_tapq:anon_956"
Wed Oct 3 07:09:13.815489 PDT 3: Schedule cleanup of "eq_tapq:anon_957"
Wed Oct 3 07:09:13.815685 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:14.200683 PDT 3: TAP (Consumer) eq_tapq:anon_958 - disconnected
Wed Oct 3 07:09:14.336430 PDT 3: TAP (Consumer) eq_tapq:anon_959 - disconnected
Wed Oct 3 07:09:14.425384 PDT 3: Schedule cleanup of "eq_tapq:anon_958"
Wed Oct 3 07:09:14.425456 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:14.425466 PDT 3: Schedule cleanup of "eq_tapq:anon_959"
Wed Oct 3 07:09:15.601665 PDT 3: TAP (Consumer) eq_tapq:anon_960 - disconnected
Wed Oct 3 07:09:15.720572 PDT 3: TAP (Consumer) eq_tapq:anon_961 - disconnected
Wed Oct 3 07:09:15.816248 PDT 3: Schedule cleanup of "eq_tapq:anon_960"
Wed Oct 3 07:09:15.816297 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:15.816359 PDT 3: Schedule cleanup of "eq_tapq:anon_961"
Wed Oct 3 07:09:16.225778 PDT 3: TAP (Consumer) eq_tapq:anon_962 - disconnected
Wed Oct 3 07:09:16.362276 PDT 3: TAP (Consumer) eq_tapq:anon_963 - disconnected
Wed Oct 3 07:09:16.448418 PDT 3: Schedule cleanup of "eq_tapq:anon_962"
Wed Oct 3 07:09:16.448512 PDT 3: Schedule cleanup of "eq_tapq:anon_963"
Wed Oct 3 07:09:16.448533 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:16.752436 PDT 3: TAP (Consumer) eq_tapq:anon_964 - disconnected
Wed Oct 3 07:09:16.908300 PDT 3: TAP (Consumer) eq_tapq:anon_965 - disconnected
Wed Oct 3 07:09:17.076167 PDT 3: Schedule cleanup of "eq_tapq:anon_964"
Wed Oct 3 07:09:17.076241 PDT 3: Schedule cleanup of "eq_tapq:anon_965"
Wed Oct 3 07:09:17.076465 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:17.422470 PDT 3: TAP (Consumer) eq_tapq:anon_966 - disconnected
Wed Oct 3 07:09:17.558228 PDT 3: TAP (Consumer) eq_tapq:anon_967 - disconnected
Wed Oct 3 07:09:17.660250 PDT 3: Schedule cleanup of "eq_tapq:anon_966"
Wed Oct 3 07:09:17.660290 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:17.660306 PDT 3: Schedule cleanup of "eq_tapq:anon_967"
Wed Oct 3 07:09:19.141698 PDT 3: TAP (Consumer) eq_tapq:anon_968 - disconnected
Wed Oct 3 07:09:19.264492 PDT 3: TAP (Consumer) eq_tapq:anon_969 - disconnected
Wed Oct 3 07:09:19.364848 PDT 3: Schedule cleanup of "eq_tapq:anon_968"
Wed Oct 3 07:09:19.364931 PDT 3: Schedule cleanup of "eq_tapq:anon_969"
Wed Oct 3 07:09:19.365354 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:19.733874 PDT 3: TAP (Consumer) eq_tapq:anon_970 - disconnected
Wed Oct 3 07:09:19.842574 PDT 3: TAP (Consumer) eq_tapq:anon_971 - disconnected
Wed Oct 3 07:09:19.941458 PDT 3: Schedule cleanup of "eq_tapq:anon_970"
Wed Oct 3 07:09:19.941843 PDT 3: Schedule cleanup of "eq_tapq:anon_971"
Wed Oct 3 07:09:19.941791 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:21.354628 PDT 3: TAP (Consumer) eq_tapq:anon_972 - disconnected
Wed Oct 3 07:09:21.480109 PDT 3: TAP (Consumer) eq_tapq:anon_973 - disconnected
Wed Oct 3 07:09:21.576311 PDT 3: Schedule cleanup of "eq_tapq:anon_972"
Wed Oct 3 07:09:21.576402 PDT 3: Schedule cleanup of "eq_tapq:anon_973"
Wed Oct 3 07:09:21.576465 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:22.841716 PDT 3: TAP (Consumer) eq_tapq:anon_974 - disconnected
Wed Oct 3 07:09:22.952241 PDT 3: TAP (Consumer) eq_tapq:anon_975 - disconnected
Wed Oct 3 07:09:23.040710 PDT 3: Schedule cleanup of "eq_tapq:anon_974"
Wed Oct 3 07:09:23.040738 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:23.040880 PDT 3: Schedule cleanup of "eq_tapq:anon_975"
Wed Oct 3 07:09:23.308185 PDT 3: TAP (Consumer) eq_tapq:anon_976 - disconnected
Wed Oct 3 07:09:23.481399 PDT 3: TAP (Consumer) eq_tapq:anon_977 - disconnected
Wed Oct 3 07:09:23.566951 PDT 3: Schedule cleanup of "eq_tapq:anon_976"
Wed Oct 3 07:09:23.567046 PDT 3: Schedule cleanup of "eq_tapq:anon_977"
Wed Oct 3 07:09:23.567184 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:24.918727 PDT 3: TAP (Consumer) eq_tapq:anon_978 - disconnected
Wed Oct 3 07:09:25.054151 PDT 3: TAP (Consumer) eq_tapq:anon_979 - disconnected
Wed Oct 3 07:09:25.142569 PDT 3: Schedule cleanup of "eq_tapq:anon_978"
Wed Oct 3 07:09:25.142638 PDT 3: Schedule cleanup of "eq_tapq:anon_979"
Wed Oct 3 07:09:25.142835 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:25.503745 PDT 3: TAP (Consumer) eq_tapq:anon_980 - disconnected
Wed Oct 3 07:09:25.631364 PDT 3: TAP (Consumer) eq_tapq:anon_981 - disconnected
Wed Oct 3 07:09:25.724000 PDT 3: Schedule cleanup of "eq_tapq:anon_980"
Wed Oct 3 07:09:25.724036 PDT 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.92 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0
Wed Oct 3 07:09:25.724083 PDT 3: Schedule cleanup of "eq_tapq:anon_981"
Wed Oct 3 07:09:27.022598 PDT 3: TAP (Consumer) eq_tapq:anon_982 - disconnected
[stats:error,2012-10-03T7:09:28.077,ns_1@10.3.121.94:<0.32178.25>:stats_collector:handle_info:106]Exception in stats collector: {exit,
                                  {{badmatch,{error,closed}},
                                   {gen_server,call,
                                       ['ns_memcached-bucket-0',
                                        {stats,<<>>},
                                        60000]}},
                                  [{gen_server,call,3},
                                   {ns_memcached,do_call,3},
                                   {stats_collector,grab_all_stats,1},
                                   {stats_collector,handle_info,2},
                                   {gen_server,handle_msg,5},
                                   {proc_lib,init_p_do_apply,3}]}

[error_logger:error,2012-10-03T7:09:28.079,ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: erlang:apply/2
    pid: <0.32169.25>
    registered_name: []
    exception error: no match of right hand side value {error,closed}
      in function mc_binary:quick_stats_recv/3
      in call from mc_binary:quick_stats_loop/5
      in call from mc_binary:quick_stats/5
      in call from ns_memcached:do_handle_call/3
      in call from ns_memcached:worker_loop/3
    ancestors: ['ns_memcached-bucket-0','single_bucket_sup-bucket-0',
                  <0.32153.25>]
    messages: []
    links: [<0.32167.25>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 196418
    stack_size: 24
    reductions: 1837035
  neighbours:

[ns_server:info,2012-10-03T7:09:28.081,ns_1@10.3.121.94:<0.32103.25>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,139}


Control connection to memcached on 'ns_1@10.3.121.94' disconnected: {badmatch,
{error,
closed}}

 Comments   
Comment by Farshid Ghods (Inactive) [ 03/Oct/12 ]
memcached crashes shuld be assigned to Chiyoung ( couchbase bucket team )
Comment by Aleksey Kondratenko [ 03/Oct/12 ]
memcached crash belongs to ep-engine
Comment by Jin Lim [ 03/Oct/12 ]
Andrei, do this test also have a scenario where running DB in non-json mode?
Comment by Jin Lim [ 03/Oct/12 ]
This is a very peculiar bug where the segv looks like being caused by invalid couchstore_file_ops for pwrite that should have been initialized during the ep engine initialization period for couchkvstore. Is it possible to rerun the test while attaching the memcached process to gdb and let gdb capture the crash again? If the test suite restarts memcached several times during the test scenario, then I guess we cannot do so. Let me know. Thanks.
Comment by Jin Lim [ 03/Oct/12 ]
Pls see my comment above.
Comment by Andrei Baranouski [ 04/Oct/12 ]
Jin, unfortunately I can't get crahes when I attach memcashed process to gdb.
after installation and when test was started i perform command 'gdb --quiet --pid=`pgrep memcached`' on each node, after some time the nodes become unavailable.

Only after I quite gdb node will work again


[ns_server:debug,2012-10-04T0:56:22.109,ns_1@127.0.0.1:<0.7025.0>:compaction_daemon:file_needs_compaction:841]Estimated file size for `bucket-0` is less than min_file_size 134348800; skipping
[ns_server:debug,2012-10-04T0:56:22.109,ns_1@127.0.0.1:compaction_daemon:compaction_daemon:handle_info:309]Finished compaction iteration.
[ns_server:debug,2012-10-04T0:56:22.109,ns_1@127.0.0.1:compaction_daemon:compaction_daemon:schedule_next_compaction:1204]Finished compaction too soon. Next run will be in 29s
[ns_server:warn,2012-10-04T1:32:14.976,ns_1@127.0.0.1:mb_master:mb_master:handle_info:218]Skipped 24 heartbeats

[ns_server:warn,2012-10-04T1:32:14.976,ns_1@127.0.0.1:menelaus_web_alerts_srv:menelaus_web_alerts_srv:handle_info:172]Eaten 16 previously unconsumed check_alerts


Jin, let me know if I do it right and what else I can help you
Comment by Andrei Baranouski [ 04/Oct/12 ]
tried the same test, node 94 was crashed again
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/62/console

Comment by Farshid Ghods (Inactive) [ 04/Oct/12 ]
Andrei,

did you get the same backtrace ?
Comment by Farshid Ghods (Inactive) [ 04/Oct/12 ]
Jin,

can you please run the test case mentioned by Andrei in the bug descriptiion.
Andrei, how many nodes does Jin need to rn this test ?
Comment by Jin Lim [ 04/Oct/12 ]
I will run the same test on our local servers as soon as I get more setting info from Andrei (it maybe help if he send me an email or attach the ini file here).
Also in the mean time we will be building a toy build that Andrei can give it a try as well. Thanks for your quick assistance, Andrei and Farshid.
Comment by Farshid Ghods (Inactive) [ 04/Oct/12 ]
test case :
./testrunner -i /tmp/rebalance_regression.ini get-logs=True,disabled_consistent_view=False -t rebalancetests.IncrementalRebalanceInTests.test_load,replica=1,delete-ratio=0.6,expiry-ratio=0.2,keys-count=1000000

do you have few servers that we can use for testing ?
Comment by Jin Lim [ 04/Oct/12 ]
yes ep engine has a few centos boxes. how many nodes does this test require? also the toy build couchbase-server-community_toy-couchstore-x86_64_2.0.0-10042012-toy.rpm is ready for testing.
Comment by Andrei Baranouski [ 05/Oct/12 ]
This test actually uses ubuntu 64 vms.(confusion in the settings/ini file)

So I was able to check toy build only on centos build.
rebalance was failed 2 times for couchbase-server-community_toy-couchstore-x86_64_2.0.0-10042012-toy.rpm, but without crashes

http://qa.hq.northscale.net/job/centos-64-2.0-new-rebalance/104/console
http://qa.hq.northscale.net/job/centos-64-2.0-new-rebalance/96/console

logs you can find by link http://qa.hq.northscale.net/job/centos-64-2.0-new-rebalance/104/artifact/*zip*/archive.zip
but unfortunately they don't contains logs when rebalance was failed.
I think it was because in the logs there are the large number of errors:

[ns_server:debug,2012-10-04T8:38:31.736,ns_1@10.3.121.112:<0.3259.0>:janitor_agent:new_style_query_vbucket_states_loop_next_step:121]Waiting for "sasl" on 'ns_1@10.3.121.113'
[ns_server:debug,2012-10-04T8:38:31.736,ns_1@10.3.121.112:<0.3260.0>:janitor_agent:new_style_query_vbucket_states_loop_next_step:121]Waiting for "sasl" on 'ns_1@10.3.121.114'
[ns_server:debug,2012-10-04T8:38:31.826,ns_1@10.3.121.112:'couch_stats_reader-mem_sasl':couch_stats_reader:vbuckets_aggregation_loop:117]Failed to open vbucket: 0 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2012-10-04T8:38:31.827,ns_1@10.3.121.112:'couch_stats_reader-mem_sasl':couch_stats_reader:vbuckets_aggregation_loop:117]Failed to open vbucket: 1 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2012-10-04T8:38:31.827,ns_1@10.3.121.112:'couch_stats_reader-mem_sasl':couch_stats_reader:vbuckets_aggregation_loop:117]Failed to open vbucket: 2 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2012-10-04T8:38:31.828,ns_1@10.3.121.112:'couch_stats_reader-mem_sasl':couch_stats_reader:vbuckets_aggregation_loop:117]Failed to open vbucket: 3 ({not_found,no_db_file}). Ignoring
[ns_server:debug,2012-10-04T8:38:31.829,ns_1@10.3.121.112:'couch_stats_reader-mem_sasl':couch_stats_reader:vbuckets_aggregation_loop:117]Failed to open vbucket: 4 ({not_found,no_db_file}). Ignoring


It would be good to check toy build against Ubuntu with the same configuration that bug was filled

Jin, I'll send email with ini files that you can use for execution

apparently this test has no special requirements on the number of vms, it rebalances incrementally
probably, that the four vms would be enough to test

Also, I started the job with the same test( ubuntu vms) on new build 2.0.0-1805-rel, it's still running successfully http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/67/consoleFull




Comment by Jin Lim [ 05/Oct/12 ]
Thanks much. I was able to run the same test on CentOS (5 nodes) and ran into the same issue. Currently debugging the issue with a possible fix.
Comment by Jin Lim [ 05/Oct/12 ]
The fix is pending to merge, http://review.couchbase.org/#/c/21396/
Comment by Thuan Nguyen [ 08/Oct/12 ]
Integrated in github-ep-engine-2-0 #438 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/438/])
    MB-6806 release returned document that is being marked deleted (Revision 3d545832ed84650e480855cf3abae6fef9fccf9d)

     Result = SUCCESS
Jin Lim :
Files :
* src/couch-kvstore/couch-kvstore.cc
Comment by Andrei Baranouski [ 10/Oct/12 ]
Jin, It was reproduced again on build 1820:

http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/69/consoleFull


gdb /opt/couchbase/bin/memcached core.memcached.30197
GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/couchbase/bin/memcached...done.
[New Thread 30214]
[New Thread 30216]
[New Thread 30217]
[New Thread 30212]
[New Thread 30218]
[New Thread 30197]
[New Thread 30205]
[New Thread 30206]
[New Thread 30208]
[New Thread 30213]
[New Thread 30210]
[New Thread 30209]
[New Thread 30211]
[New Thread 30215]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /opt/couchbase/lib/memcached/libmemcached_utilities.so.0...done.
Loaded symbols for /opt/couchbase/lib/memcached/libmemcached_utilities.so.0
Reading symbols from /opt/couchbase/lib/libevent-2.0.so.5...done.
Loaded symbols for /opt/couchbase/lib/libevent-2.0.so.5
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
Reading symbols from /opt/couchbase/lib/libtcmalloc_minimal.so.4...done.
Loaded symbols for /opt/couchbase/lib/libtcmalloc_minimal.so.4
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
Reading symbols from /opt/couchbase/lib/memcached/stdin_term_handler.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/stdin_term_handler.so
Reading symbols from /opt/couchbase/lib/memcached/file_logger.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/file_logger.so
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1
Reading symbols from /opt/couchbase/lib/memcached/bucket_engine.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/bucket_engine.so
Reading symbols from /opt/couchbase/lib/memcached/ep.so...done.
Loaded symbols for /opt/couchbase/lib/memcached/ep.so
Reading symbols from /opt/couchbase/lib/libcouchstore.so.1...done.
Loaded symbols for /opt/couchbase/lib/libcouchstore.so.1
Reading symbols from /opt/couchbase/lib/libsnappy.so.1...done.
Loaded symbols for /opt/couchbase/lib/libsnappy.so.1
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f53fd51383b in flush_buffer (handle=0xd3a6750, buf=<value optimized out>, nbyte=8, offset=214483) at src/iobuffer.c:106
106 src/iobuffer.c: No such file or directory.
in src/iobuffer.c
(gdb) t a a bt

Thread 14 (Thread 30215):
#0 0x00007f5402302f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f53fd73ed68 in wait (this=0x8f4360, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x8f4360, d=...) at src/dispatcher.cc:336
#3 0x00007f53fd741d1a in Dispatcher::run (this=0x51a5880) at src/dispatcher.cc:173
#4 0x00007f53fd74261b in launch_dispatcher_thread (arg=0x51a58d4) at src/dispatcher.cc:28
#5 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 13 (Thread 30211):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158a00, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158a00, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x8ef6e8) at daemon/thread.c:301
#4 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 12 (Thread 30209):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158280, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158280, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x8ef4f8) at daemon/thread.c:301
#4 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 11 (Thread 30210):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158c80, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158c80, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x8ef5f0) at daemon/thread.c:301
#4 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 10 (Thread 30213):
#0 0x00007f540200f4ed in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402040914 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f53fd77e525 in updateStatsThread (arg=<value optimized out>) at src/memory_tracker.cc:31
#3 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 9 (Thread 30208):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158500, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158500, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x8ef400) at daemon/thread.c:301
#4 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#5 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 8 (Thread 30206):
#0 0x00007f5402302f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f5400e3d176 in logger_thead_main (arg=<value optimized out>) at extensions/loggers/file_logger.c:368
#2 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()

Thread 7 (Thread 30205):
#0 0x00007f54020392ed in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5401fd4798 in _IO_file_underflow () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f5401fd57be in _IO_default_uflow () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f5401fc98fa in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f5401fc87ca in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f5401840b19 in fgets (arg=<value optimized out>) at /usr/include/bits/stdio2.h:255
#6 check_stdin_thread (arg=<value optimized out>) at extensions/daemon/stdin_check.c:37
#7 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 6 (Thread 30197):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158000, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158000, flags=<value optimized out>) at event.c:1558
#3 0x000000000040c2d1 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7912

Thread 5 (Thread 30218):
#0 0x00007f5402302f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0





#1 0x00007f53fd75dd9f in wait (this=0x51ae000) at src/syncobject.hh:58
#2 wait (this=0x51ae000) at src/syncobject.hh:74
#3 wait (this=0x51ae000) at src/tapconnmap.hh:169
#4 EventuallyPersistentEngine::notifyPendingConnections (this=0x51ae000) at src/ep_engine.cc:3395
#5 0x00007f53fd75de83 in EvpNotifyPendingConns (arg=0x51ae000) at src/ep_engine.cc:1122
#6 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000000000 in ?? ()

Thread 4 (Thread 30212):
#0 0x00007f5402048633 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5402e16f36 in epoll_dispatch (base=0x5158780, tv=<value optimized out>) at epoll.c:404
#2 0x00007f5402e02394 in event_base_loop (base=0x5158780, flags=<value optimized out>) at event.c:1558
#3 0x0000000000414b94 in worker_libevent (arg=0x8ef7e0) at daemon/thread.c:301
#4 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 3 (Thread 30217):
#0 0x00007f5402302f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f53fd73ed68 in wait (this=0x8f4240, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x8f4240, d=...) at src/dispatcher.cc:336
---Type <return> to continue, or q <return> to quit---
#3 0x00007f53fd741d1a in Dispatcher::run (this=0x51a5500) at src/dispatcher.cc:173
#4 0x00007f53fd74261b in launch_dispatcher_thread (arg=0x51a5554) at src/dispatcher.cc:28
#5 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 2 (Thread 30216):
#0 0x00007f5402302f2b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f53fd73ed68 in wait (this=0x8f42d0, d=...) at src/syncobject.hh:58
#2 IdleTask::run (this=0x8f42d0, d=...) at src/dispatcher.cc:336
#3 0x00007f53fd741d1a in Dispatcher::run (this=0x51a56c0) at src/dispatcher.cc:173
#4 0x00007f53fd74261b in launch_dispatcher_thread (arg=0x51a5714) at src/dispatcher.cc:28
#5 0x00007f54022fdd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f5402047fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()

Thread 1 (Thread 30214):
#0 0x00007f53fd51383b in flush_buffer (handle=0xd3a6750, buf=<value optimized out>, nbyte=8, offset=214483) at src/iobuffer.c:106
#1 buffered_pread (handle=0xd3a6750, buf=<value optimized out>, nbyte=8, offset=214483) at src/iobuffer.c:271
#2 0x00007f53fd511efc in read_skipping_prefixes (db=0x515f1f0, pos=214483, ret_ptr=<value optimized out>, header=0) at src/couch_file_read.c:27
#3 pread_bin_internal (db=0x515f1f0, pos=214483, ret_ptr=<value optimized out>, header=0) at src/couch_file_read.c:53
#4 0x00007f53fd512107 in pread_compressed (db=0xd3a6750, pos=4294967295, ret_ptr=0x7f53fc8f42f8) at src/couch_file_read.c:92
#5 0x00007f53fd50dcda in modify_node (rq=0x7f53fc8f43d0, nptr=<value optimized out>, start=0, end=94, dst=0xcd30010) at src/btree_modify.c:321
#6 0x00007f53fd50e8b7 in modify_btree (rq=0x7f53fc8f43d0, root=0x9cf9540, errcode=0x7f53fc8f44dc) at src/btree_modify.c:620
#7 0x00007f53fd511555 in update_indexes (db=<value optimized out>, seqs=0x51615a0, seqvals=0x51615c0, ids=<value optimized out>, idvals=0x51615d0, numdocs=1) at src/couch_save.c:198
#8 0x00007f53fd511c41 in couchstore_save_documents (db=<value optimized out>, docs=0x8e8360, infos=0x8e82f8, numdocs=<value optimized out>, options=<value optimized out>)
    at src/couch_save.c:354


Comment by Andrei Baranouski [ 10/Oct/12 ]
Jin, if it's needed I can provide ini file to reproduce

http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1820-rel.deb.manifest.xml
Comment by Jin Lim [ 10/Oct/12 ]
Let's first get some input from the Couchstore team.

Aaron - can you please take a quick look at the above stack dump and see if you can shed more lights on how the iobuffer.c#106 could get caught on the seg fault?

Thanks,
Jin
Comment by Jin Lim [ 10/Oct/12 ]
Let's first get some input from the Couchstore team.

Aaron - can you please take a quick look at the above stack dump and see if you can shed more lights on how the iobuffer.c#106 could get caught on the seg fault?

Thanks,
Jin
Comment by Aaron Miller (Inactive) [ 11/Oct/12 ]
This is the line on which Couchstore's I/O write buffer layer calls the underlying file system layer. It's blowing up on a pointer dereference, which when running inside ep-engine will be trying to dereference ep-engine's filesystem stats collector. It appears as if that object (or at least the pointer that was given to Couchstore originally), became invalid.
Comment by Aaron Miller (Inactive) [ 11/Oct/12 ]
If you can easily reproduce this, you can test if this is the case by swapping the line here: https://github.com/couchbase/ep-engine/blob/master/src/couch-kvstore/couch-kvstore.cc#L1186

with:
 couch_file_ops* ops = couchstore_get_default_file_ops();

This will disable K/V timings stats, but if it makes the crash go away, that means the issue is with the management of the statCollectingFileOps struct inside CouchKVStore.
Comment by Jin Lim [ 11/Oct/12 ]
2-nodes cluster on CentOS + the latest build didn't run into the same issue.
While the development team continuing debugging, QE (Andrei) will rerun the same test on their 6-nodes cluster against the latest build as well.
Comment by Jin Lim [ 11/Oct/12 ]
Thanks much for the input and suggestion, Aaron.
Hmm statCollectingFileOps is a member struct inside CouchKVStore which gets initialized during the ep engin startup phase. Since there weren't bucket deletion/creation around the time of the segfault (i.e. no CouchKVStore destructor was ever called) it is very much unlikely that the statCollectingFileOps becomes invalid by pre-mature destruction of CouchKVStore object. And, if this is memory corruption the error behavior should have been more undefined, erratic. Will still give it a try, thanks!
Comment by Andrei Baranouski [ 12/Oct/12 ]
build 2.0.0-1841-rel
test is running more than 2 hours successfully
http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/73/console
Comment by Andrei Baranouski [ 15/Oct/12 ]
on 2.0.0-1849-rel crash is not reproduced also
Comment by Farshid Ghods (Inactive) [ 15/Oct/12 ]
Hi Jin,

can we resolve this bug as fixed ?
Comment by Jin Lim [ 15/Oct/12 ]
We cannot reproduce it any longer.
Comment by kzeller [ 26/Oct/12 ]
Server now releases returned document that
is being marked deleted. This fixes a
rebalance failure where memcached crashes on
one of the nodes and rebalance exits
with the error: {mover_failed,{badmatch, known macro:
{error,closed}}}
Comment by Thuan Nguyen [ 01/Nov/12 ]
Integrated in github-ep-engine-2-0 #456 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/456/])
    MB-6806 ep engine returns key not found for a deleted doc (Revision 386c78bc4a98001e4e09540b846bda93600e7207)

     Result = SUCCESS
Jin :
Files :
* tests/ep_testsuite.cc
* src/couch-kvstore/couch-kvstore.cc
Generated at Wed Oct 22 05:47:24 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.