[MB-7272] memcached/ep-engine crashes in flusher or other paths when it receives a shutdown message from ns-server Created: 27/Nov/12  Updated: 19/Dec/12  Resolved: 05/Dec/12

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 1.8.1, 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Farshid Ghods (Inactive) Assignee: Jin Lim
Resolution: Fixed Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-7110 [system test] rebalance failed due to... Technical task Resolved Jin Lim  
MB-7263 Service memcached constantly exited o... Technical task Resolved Aleksey Kondratenko  

 Description   
this case could occur in many places

1- when node is waming up and ns-server sneds a shutdown command to delete the bucket during warmup
2- when node is warming up ( a failed over node ) ns-server sends a shutdown command to delete the bucket
3- when a node which was rebalanced out but for some reason memcached is still doing sth , ns-server sends a shut down command

scenario #2 is very very common and in large environments where warm up takes 8 hours or so user will keep retrying the rebalance button and it wont succeed unless user manually kills the memcached process manually by running kill command.

in general ep-engine needs to abort instead of crashing
on the other hand during normal shutdown , when ns-server sends a command to ep-engine to shut down . ep-engine should wait until all items are flushed and then shutdown.

seems like we need to differentiate a command that says shut down gracefully or shut down with force.

some of the bugs :

http://www.couchbase.com/issues/browse/MB-7110
http://www.couchbase.com/issues/browse/MB-7263

 Comments   
Comment by Jin Lim [ 30/Nov/12 ]
The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks!

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-couchstore-x86_64_2.0.0-11302012A-toy.rpm
Comment by Andrei Baranouski [ 03/Dec/12 ]
tried to test toy build for cases in MB-7110 [system test] rebalance failed due to "Failed to wait deletion of some buckets on some nodes"
with steps:
1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated
10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115
2. load ~1.6M items in each bucket
3. add node in cluster 10.3.121.116

result: received exactly the same erros as in the
MB-7263 Service memcached constantly exited on dest master node after certain steps in XDCR + rebalance scenarious: Port server memcached on node 'ns_1@10.3.121.63' exited with status 71. failed to listen on TCP port 11210: Address already in use

Port server memcached on node 'ns_1@10.3.121.112' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use

leave the cluster alive for investigation
 

Comment by Jin Lim [ 03/Dec/12 ]
Thanks Andrei. Please leave the cluster while the development team is investigating the issue.

In the mean time please note that:

1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads.
2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of MB-7263. Which I will continue to investigate from this point on.

Thanks,
Jin
  
Comment by Steve Yen [ 04/Dec/12 ]
from bug-scrub mtg,

looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in.
Comment by Farshid Ghods (Inactive) [ 05/Dec/12 ]
build 1974 has this fix
Comment by kzeller [ 06/Dec/12 ]
Added to RN as:

During Couchbase Server warmup or rebalance, if you delete a data bucket,
it will cause the node to crash.
Comment by Thuan Nguyen [ 19/Dec/12 ]
Integrated in github-ep-engine-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/461/])
    MB-7272 stop warmup task immediately if shutdown is being requested (Revision 6b89027ba3b2b461d978d593b14918040c819e2c)

     Result = SUCCESS
Jin :
Files :
* src/warmup.cc
* src/warmup.hh
* src/ep.cc
* src/ep.hh
Generated at Mon Sep 15 06:04:40 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.