libcouchbase client crashes when stopping server

We are experiencing crashes of the libcouchbase C client on a dual node cluster when one of the couchbase servers is stopped while the application accesses the bucket(s). Our application uses get-locked operations, which are not handled by the client when handling network- or timeout errors. See also issue CCBC-274.

In CCBC-274 I've also provided a patch to do handle get-lock (and unlock, which was also missing), this improves the situation a lot.

However, yesterday we had a crash again when the CB server was stopped on one of the servers:

Program terminated with signal 6, Aborted(gdb) where
#0  0x00007f7407bbf8e5 in raise () from /lib64/libc.so.6
#1  0x00007f7407bc10c5 in abort () from /lib64/libc.so.6
#2  0x00007f740997269a in ringbuffer_consumed (buffer=<value optimized out>, nb=41) at src/ringbuffer.c:263
#3  0x00007f7409973c5c in lcb_purge_single_server (server=0x7f73b0002208, error=LCB_ETIMEDOUT) at src/server.c:338
#4  0x00007f74099758d1 in lcb_server_timeout_handler (sock=<value optimized out>, which=<value optimized out>, arg=<value optimized out>) at src/timeout.c:38
#5  0x00007f7409b8cb44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#6  0x000000000046e476 in con_thr (arg=0x1471218) at src/store_couchbase.c:2659
#7  0x00007f7407f27851 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f7407c7594d in clone () from /lib64/libc.so.6
(gdb) up
#1  0x00007f7407bc10c5 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00007f740997269a in ringbuffer_consumed (buffer=<value optimized out>, nb=41) at src/ringbuffer.c:263
263	src/ringbuffer.c: No such file or directory.
	in src/ringbuffer.c
(gdb) p n
$2 = 0
(gdb) up
#3  0x00007f7409973c5c in lcb_purge_single_server (server=0x7f73b0002208, error=LCB_ETIMEDOUT) at src/server.c:338
338	src/server.c: No such file or directory.
	in src/server.c
(gdb) p req
$1 = {request = {magic = 128 '\200', opcode = 148 '\224', keylen = 3328, extlen = 4 '\004', datatype = 0 '\000', vbucket = 52225, bodylen = 285212672, opaque = 1347961, cas = 0}, 
  bytes = "\200\224\000\r\004\000\001\314\000\000\000\021y\221\024\000\000\000\000\000\000\000\000"}
336        ringbuffer_consumed(stream, packetsize);
337        if (mirror) {
338            ringbuffer_consumed(mirror, packetsize);
339        }

So the crash is caused by the following abort in ringbuffer_consumed() :

    lcb_size_t n = ringbuffer_read(buffer, NULL, nb);
    if (n != nb) {
        abort();
    }

(n = 0, nb = 41)
Apparently, ringbuffer_read() returning not the number of expected bytes is possible and should not trigger an abort.

Is there any effort ongoing to improve the stability of the libcouchbase C client, especially for failover situations?

CentOS 6.4, couchbase-server-2.1.1, libcouchbase2-2.0.6-2 (2.0.6-1 with patch CCBC-274)
(libcouchbase 2.1.3 behaves exactly the same)

2 Answers

« Back to question.

Interesting I've been having the same crash, for a little while, I observed it first on 2.0.6, when I first posted about it I was recommend to move to 2.0.7, this did fix a different crash for me but the assert you've see came back.

Here is my 2.0.7 backtrace:

#0 0x00799410 in __kernel_vsyscall ()
#1 0x00875df0 in raise () from /lib/libc.so.6
#2 0x00877701 in abort () from /lib/libc.so.6
#3 0x082a8476 in ringbuffer_consumed (buffer=0x9940d4ac, nb=82) at src/ringbuffer.c:271
#4 0x082a8796 in lcb_purge_single_server (server=0x9940d448, error=LCB_ETIMEDOUT) at src/server.c:331
#5 0x082aaef7 in lcb_server_timeout_handler (sock=-1, which=1, arg=0x9940d448) at src/timeout.c:38
#6 0x082b5ce8 in event_process_active (base=0x6f161000) at event.c:385
#7 0x082b5f3d in event_base_loop (base=0x6f161000, flags=0) at event.c:525
#8 0x082b55d1 in lcb_io_run_event_loop (iops=0x0) at plugins/io/libevent/plugin-libevent.c:320
#9 0x082ab860 in lcb_wait (instance=0x96d100c8) at src/wait.c:67
#10 0x082a9f83 in lcb_synchandler_return (instance=0x96d100c8, retcode=LCB_SUCCESS) at src/synchandler.c:263
#11 0x082a29f2 in lcb_arithmetic (instance=0x96d100c8, command_cookie=0x12df8690, num=1, items=0xa8224f28) at src/arithmetic.c:40

I'd been having trouble working through the cause, so eventually I wrote a funtion to walk the buffers in the same way as lcb_purge_single_server() but without making any modifications, pasted below.

I called this from some places I deemed to be interesting, after few iterations this brought me as close as I could get with this method:

I had the following at the bottom of lcb_arithmetic():

TRACE_ARITHMETIC_BEGIN(&req, key, nkey, delta, initial,
create ? exp : (lcb_time_t)0xffffffff);
lcb_check_single_server(server);
lcb_server_start_packet(server, command_cookie, req.bytes,
sizeof(req.bytes));
lcb_server_write_packet(server, key, nkey);
lcb_server_end_packet(server);
lcb_server_send_packets(server);
lcb_check_single_server(server);

I could see the assert() triggering in the second call to lcb_check_single_server()

I added some more code to grab the state of the buffers before the functions:

lcb_server_start_packet()
lcb_server_write_packet()
lcb_server_end_packet()
lcb_server_send_packets()

The state of the buffers at the different lines was:

before lcb_server_start_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b8617e, size = 256, bytes = 82},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca1fd8, size = 128, bytes = 0},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd750, size = 128, bytes = 0},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_write_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861aa, size = 256, bytes = 126},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2004, size = 128, bytes = 44},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_end_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861ca, size = 256, bytes = 158},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2024, size = 128, bytes = 76},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_send_packets()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861ca, size = 256, bytes = 158},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2024, size = 128, bytes = 76},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

I think it's the initial state of the of the buffers that looks unexpected in state server->connected = 0. That's pretty much as far as I've gotten with this today, I'm trying to think about where I go from here. But that's work for next week, in the meanwhile I'm very interted in any further thoughts or suggestions on this.

Here is my checker function:

void lcb_check_single_server(lcb_server_t *server)
{
protocol_binary_request_header req;
struct lcb_command_data_st ct;
lcb_size_t nr;
char *packet;
lcb_size_t packetsize;
lcb_t root = server->instance;
ringbuffer_t rest;
ringbuffer_t stream_copy = server->cmd_log;
ringbuffer_t *stream = &stream_copy;
ringbuffer_t cookies_copy;
ringbuffer_t *cookies;
ringbuffer_t *mirror = NULL; /* mirror buffer should be purged with main stream */
ringbuffer_t mirror_copy;
ringbuffer_t output_copy;
lcb_size_t send_size = ringbuffer_get_nbytes(&server->output);
lcb_size_t stream_size = ringbuffer_get_nbytes(stream);
hrtime_t now = gethrtime();

if (server->connected) {
cookies_copy = server->output_cookies;
cookies = &cookies_copy;
} else {
cookies_copy = server->pending_cookies;
cookies = &cookies_copy;
mirror_copy = server->pending;
mirror = &mirror_copy;
}

output_copy = server->output;
assert(ringbuffer_initialize(&rest, 1024));

do {
nr = ringbuffer_peek(cookies, &ct, sizeof(ct));
if (nr != sizeof(ct)) {
break;
}
nr = ringbuffer_peek(stream, req.bytes, sizeof(req));
if (nr != sizeof(req)) {
break;
}
packetsize = (lcb_uint32_t)sizeof(req) + ntohl(req.request.bodylen);
if (stream->nbytes < packetsize) {
break;
}

ringbuffer_consumed(cookies, sizeof(ct));
assert(nr == sizeof(req));
packet = stream->read_head;

if (server->connected && stream_size > send_size && (stream_size - packetsize) < send_size) {
/* Copy the rest of the current packet into the
temporary stream */

lcb_size_t nbytes = packetsize - (stream_size - send_size);
assert(ringbuffer_memcpy(&rest,
&output_copy,
nbytes) == 0);
ringbuffer_consumed(&output_copy, nbytes);
send_size -= nbytes;
}
stream_size -= packetsize;

ringbuffer_consumed(stream, packetsize);
if (mirror) {
ringbuffer_consumed(mirror, packetsize);
}

} while (1); /* CONSTCOND */

ringbuffer_destruct(&rest);
}

« Back to question.

The issue has been fixed in this patch: http://review.couchbase.org/29344