[MB-7272] memcached/ep-engine crashes in flusher or other paths when it receives a shutdown message from ns-server

this case could occur in many places

1- when node is waming up and ns-server sneds a shutdown command to delete the bucket during warmup
2- when node is warming up ( a failed over node ) ns-server sends a shutdown command to delete the bucket
3- when a node which was rebalanced out but for some reason memcached is still doing sth , ns-server sends a shut down command

scenario #2 is very very common and in large environments where warm up takes 8 hours or so user will keep retrying the rebalance button and it wont succeed unless user manually kills the memcached process manually by running kill command.

in general ep-engine needs to abort instead of crashing
on the other hand during normal shutdown , when ns-server sends a command to ep-engine to shut down . ep-engine should wait until all items are flushed and then shutdown.

seems like we need to differentiate a command that says shut down gracefully or shut down with force.

some of the bugs :


Comment by Jin Lim [ 30/Nov/12 ]
The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks!

Comment by Andrei Baranouski [ 03/Dec/12 ]
tried to test toy build for cases in MB-7110 [system test] rebalance failed due to "Failed to wait deletion of some buckets on some nodes"
with steps:
1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated,,,
2. load ~1.6M items in each bucket
3. add node in cluster

result: received exactly the same erros as in the
MB-7263 Service memcached constantly exited on dest master node after certain steps in XDCR + rebalance scenarious: Port server memcached on node 'ns_1@' exited with status 71. failed to listen on TCP port 11210: Address already in use

Port server memcached on node 'ns_1@' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use

leave the cluster alive for investigation

Comment by Jin Lim [ 03/Dec/12 ]
Thanks Andrei. Please leave the cluster while the development team is investigating the issue.

In the mean time please note that:

1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads.
2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of MB-7263. Which I will continue to investigate from this point on.

Comment by Steve Yen [ 04/Dec/12 ]
from bug-scrub mtg,

looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in.
Comment by Farshid Ghods (Inactive) [ 05/Dec/12 ]
build 1974 has this fix
Comment by kzeller [ 06/Dec/12 ]
Added to RN as:

During Couchbase Server warmup or rebalance, if you delete a data bucket,
it will cause the node to crash.
Comment by Thuan Nguyen [ 19/Dec/12 ]
