[MB-6592] [longevity] memcached hangs when aborting during swap rebalance operation and fails to restart ( exit 71 ) Created: 10/Sep/12  Updated: 10/Jan/13  Resolved: 26/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-beta
Fix Version/s: 2.0-beta-2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Cannot Reproduce Votes: 0
Labels: 2.0-beta-release-notes, system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64 bit build 2.0.0-1697

Attachments: File memcached_logfile    

 Description   
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node (80% total system memmories)
- Disk format ext3 on both data and root
- Each server has its own drive, no disk sharing with other server.
- Load 9 million items to both buckets
- Initial indexing, so cpu a little heavy load
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

* Create cluster with 10 nodes installed couchbase server 2.0.0-1697

10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23
* Data path /data
* View path /data

* Do swap rebalance. Add node 26 and remove node 25
* Rebalance failed and saw a lot of error message memcached exited with status 71 in log page.

Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-memcached-exit-71-20120910.tgz

Link to atop node 13 https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
Due to large size of atop file, all other atop files are in /tmp directory of each node


 Comments   
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
From the memcached logs



Sat Sep 8 17:56:06.159645 3: Connection closed by mccouch
Sat Sep 8 17:56:06.159864 3: Trying to connect to mccouch: "localhost:11213"
Sat Sep 8 17:56:06.160432 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.160462 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.165925 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.165961 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.171397 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.171430 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.176813 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.176839 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.182241 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.182267 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.187653 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.187679 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.193033 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.193058 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.198462 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.198518 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.204013 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.204038 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.209451 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.209477 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.214788 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.214813 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.220685 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.220712 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.226387 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.226415 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.231879 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.231909 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.237313 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.237339 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.242731 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.242757 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.248138 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.248164 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.253603 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.253630 3: Failed to connect to: "localhost:11213"
Sat Sep 8 17:56:06.261017 3: Failed to connect to: "localhost:11213"
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
GDB output of the memcached process:-
https://friendpaste.com/5gFQw9wPBFOgNjue64HfM2


Not sure if related to
http://www.couchbase.com/issues/browse/MB-5653
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
We have the core file of the memcached process available if needed.
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
10.3.121.13 (root/couchbase)

/root/karan
Comment by Farshid Ghods (Inactive) [ 10/Sep/12 ]
Karan,

shouldn't this be assigned to couchbase bucket team first ?
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
Assigning to Chiyoung to see if this is ep-engine related
Comment by Chiyoung Seo [ 10/Sep/12 ]
Mike, please take a look at this issue.
Comment by Trond Norbye [ 11/Sep/12 ]
Just guessing here, but from the callstack it looks like the crash is in the assign() method for the basic_string in flushOneDelOrSet.. (the other stacks seems somewhat sane)..
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
Given we're spinning, looks similar to a case I hit some time ago when some list in flusher was 'infinite'. I even showed this to Chiyoung but we were unable to make any sense of what we saw
Comment by Mike Wiederhold [ 11/Sep/12 ]
Trond,

I think this might be in the logging thread. I found this in the backtrace with gdb.

Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.'.
#0 0x0000003d7160b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

There are a few ep-engine threads at this location also, but since the logging stuff is relatively new I think it might be the cause. Please take a look at this and assign it back to me if you think the problem is in ep-engine.
Comment by Trond Norbye [ 12/Sep/12 ]
I find it relatively hard to believe that it may dump core on that line given that the code for that looks like:

At file scope we have:

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

The call we're currently stuck in looks like:

    pthread_mutex_lock(&mutex);
    while (run) {
        struct timeval tp;
        gettimeofday(&tp, NULL);

        [ ... cut ...]

        gettimeofday(&tp, NULL);
        next = tp.tv_sec + (unsigned int)sleeptime;
        struct timespec ts = { .tv_sec = next };
        pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck
    }

I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to _crash_ (it _could_ return with EINVAL for invalid input arguments)...

If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc)


Comment by Chiyoung Seo [ 12/Sep/12 ]
Let me take a look at this issue to see if there are anything suspicious in ep-engine.
Comment by Chiyoung Seo [ 26/Sep/12 ]
Tony,

I was not able to reproduce this issue with 4 node cluster and still don't know why it happened.

Did you see the same issue recently in your manual and longevity test?
Comment by Thuan Nguyen [ 26/Sep/12 ]
I have not seen this issue since then in my system test.
Comment by Chiyoung Seo [ 26/Sep/12 ]
Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine.
Generated at Fri Oct 24 07:54:20 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.