[MB-7272] memcached/ep-engine crashes in flusher or other paths when it receives a shutdown message from ns-server Created: 27/Nov/12 Updated: 19/Dec/12 Resolved: 05/Dec/12
|Affects Version/s:||1.8.1, 2.0|
|Reporter:||Farshid Ghods||Assignee:||Jin Lim|
|Σ Remaining Estimate:||Not Specified||Remaining Estimate:||Not Specified|
|Σ Time Spent:||Not Specified||Time Spent:||Not Specified|
|Σ Original Estimate:||Not Specified||Original Estimate:||Not Specified|
this case could occur in many places
1- when node is waming up and ns-server sneds a shutdown command to delete the bucket during warmup
2- when node is warming up ( a failed over node ) ns-server sends a shutdown command to delete the bucket
3- when a node which was rebalanced out but for some reason memcached is still doing sth , ns-server sends a shut down command
scenario #2 is very very common and in large environments where warm up takes 8 hours or so user will keep retrying the rebalance button and it wont succeed unless user manually kills the memcached process manually by running kill command.
in general ep-engine needs to abort instead of crashing
on the other hand during normal shutdown , when ns-server sends a command to ep-engine to shut down . ep-engine should wait until all items are flushed and then shutdown.
seems like we need to differentiate a command that says shut down gracefully or shut down with force.
some of the bugs :
|Comment by Jin Lim [ 30/Nov/12 ]|
The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks!
|Comment by Andrei Baranouski [ 03/Dec/12 ]|
tried to test toy build for cases in |
1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated
10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115
2. load ~1.6M items in each bucket
3. add node in cluster 10.3.121.116
result: received exactly the same erros as in the
Port server memcached on node 'email@example.com' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use
leave the cluster alive for investigation
|Comment by Jin Lim [ 03/Dec/12 ]|
Thanks Andrei. Please leave the cluster while the development team is investigating the issue.
In the mean time please note that:
1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads.
2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of
|Comment by Steve Yen [ 04/Dec/12 ]|
from bug-scrub mtg,
looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in.
|Comment by Farshid Ghods [ 05/Dec/12 ]|
|build 1974 has this fix|
|Comment by Karen Zeller [ 06/Dec/12 ]|
Added to RN as:
During Couchbase Server warmup or rebalance, if you delete a data bucket,
it will cause the node to crash.
|Comment by Thuan Nguyen [ 19/Dec/12 ]|
Integrated in github-ep-engine-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/461/])|
Result = SUCCESS