[MB-6592] [longevity] memcached hangs when aborting during swap rebalance operation and fails to restart ( exit 71 ) Created: 10/Sep/12 Updated: 10/Jan/13 Resolved: 26/Sep/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 2.0-beta |
| Fix Version/s: | 2.0-beta-2 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Thuan Nguyen | Assignee: | Thuan Nguyen |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 2.0-beta-release-notes, system-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 6.2 64 bit build 2.0.0-1697 | ||
| Attachments: |
|
| Description |
|
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU - Each server has 10 GB RAM and 150 GB disk. - 8 GB RAM for couchbase server at each node (80% total system memmories) - Disk format ext3 on both data and root - Each server has its own drive, no disk sharing with other server. - Load 9 million items to both buckets - Initial indexing, so cpu a little heavy load - Cluster has 2 buckets, default (3GB) and saslbucket (3GB) - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) * Create cluster with 10 nodes installed couchbase server 2.0.0-1697 10.3.121.13 10.3.121.14 10.3.121.15 10.3.121.16 10.3.121.17 10.3.121.20 10.3.121.22 10.3.121.24 10.3.121.25 10.3.121.23 * Data path /data * View path /data * Do swap rebalance. Add node 26 and remove node 25 * Rebalance failed and saw a lot of error message memcached exited with status 71 in log page. Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-memcached-exit-71-20120910.tgz Link to atop node 13 https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13 Due to large size of atop file, all other atop files are in /tmp directory of each node |
| Comments |
| Comment by Karan Kumar [ 10/Sep/12 ] |
|
From the memcached logs
Sat Sep 8 17:56:06.159645 3: Connection closed by mccouch Sat Sep 8 17:56:06.159864 3: Trying to connect to mccouch: "localhost:11213" Sat Sep 8 17:56:06.160432 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.160462 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.165925 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.165961 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.171397 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.171430 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.176813 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.176839 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.182241 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.182267 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.187653 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.187679 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.193033 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.193058 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.198462 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.198518 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.204013 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.204038 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.209451 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.209477 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.214788 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.214813 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.220685 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.220712 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.226387 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.226415 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.231879 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.231909 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.237313 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.237339 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.242731 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.242757 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.248138 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.248164 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.253603 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.253630 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.261017 3: Failed to connect to: "localhost:11213" |
| Comment by Karan Kumar [ 10/Sep/12 ] |
|
GDB output of the memcached process:-
https://friendpaste.com/5gFQw9wPBFOgNjue64HfM2 Not sure if related to http://www.couchbase.com/issues/browse/MB-5653 |
| Comment by Karan Kumar [ 10/Sep/12 ] |
| We have the core file of the memcached process available if needed. |
| Comment by Karan Kumar [ 10/Sep/12 ] |
|
10.3.121.13 (root/couchbase)
/root/karan |
| Comment by Farshid Ghods [ 10/Sep/12 ] |
|
Karan,
shouldn't this be assigned to couchbase bucket team first ? |
| Comment by Karan Kumar [ 10/Sep/12 ] |
| Assigning to Chiyoung to see if this is ep-engine related |
| Comment by Chiyoung Seo [ 10/Sep/12 ] |
| Mike, please take a look at this issue. |
| Comment by Trond Norbye [ 11/Sep/12 ] |
| Just guessing here, but from the callstack it looks like the crash is in the assign() method for the basic_string in flushOneDelOrSet.. (the other stacks seems somewhat sane).. |
| Comment by Aleksey Kondratenko [ 11/Sep/12 ] |
| Given we're spinning, looks similar to a case I hit some time ago when some list in flusher was 'infinite'. I even showed this to Chiyoung but we were unable to make any sense of what we saw |
| Comment by Mike Wiederhold [ 11/Sep/12 ] |
|
Trond, I think this might be in the logging thread. I found this in the backtrace with gdb. Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.'. #0 0x0000003d7160b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 There are a few ep-engine threads at this location also, but since the logging stuff is relatively new I think it might be the cause. Please take a look at this and assign it back to me if you think the problem is in ep-engine. |
| Comment by Trond Norbye [ 12/Sep/12 ] |
|
I find it relatively hard to believe that it may dump core on that line given that the code for that looks like:
At file scope we have: static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; The call we're currently stuck in looks like: pthread_mutex_lock(&mutex); while (run) { struct timeval tp; gettimeofday(&tp, NULL); [ ... cut ...] gettimeofday(&tp, NULL); next = tp.tv_sec + (unsigned int)sleeptime; struct timespec ts = { .tv_sec = next }; pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck } I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to _crash_ (it _could_ return with EINVAL for invalid input arguments)... If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc) |
| Comment by Chiyoung Seo [ 12/Sep/12 ] |
| Let me take a look at this issue to see if there are anything suspicious in ep-engine. |
| Comment by Chiyoung Seo [ 26/Sep/12 ] |
|
Tony, I was not able to reproduce this issue with 4 node cluster and still don't know why it happened. Did you see the same issue recently in your manual and longevity test? |
| Comment by Thuan Nguyen [ 26/Sep/12 ] |
| I have not seen this issue since then in my system test. |
| Comment by Chiyoung Seo [ 26/Sep/12 ] |
| Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine. |