[MB-7272] memcached/ep-engine crashes in flusher or other paths when it receives a shutdown message from ns-server Created: 27/Nov/12 Updated: 19/Dec/12 Resolved: 05/Dec/12 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 1.8.1, 2.0 |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Farshid Ghods | Assignee: | Jin Lim |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
| Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
| Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
| Sub-Tasks: |
|
| Description |
|
this case could occur in many places
1- when node is waming up and ns-server sneds a shutdown command to delete the bucket during warmup 2- when node is warming up ( a failed over node ) ns-server sends a shutdown command to delete the bucket 3- when a node which was rebalanced out but for some reason memcached is still doing sth , ns-server sends a shut down command scenario #2 is very very common and in large environments where warm up takes 8 hours or so user will keep retrying the rebalance button and it wont succeed unless user manually kills the memcached process manually by running kill command. in general ep-engine needs to abort instead of crashing on the other hand during normal shutdown , when ns-server sends a command to ep-engine to shut down . ep-engine should wait until all items are flushed and then shutdown. seems like we need to differentiate a command that says shut down gracefully or shut down with force. some of the bugs : http://www.couchbase.com/issues/browse/MB-7110 http://www.couchbase.com/issues/browse/MB-7263 |
| Comments |
| Comment by Jin Lim [ 30/Nov/12 ] |
|
The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks!
http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-couchstore-x86_64_2.0.0-11302012A-toy.rpm |
| Comment by Andrei Baranouski [ 03/Dec/12 ] |
|
tried to test toy build for cases in with steps: 1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated 10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115 2. load ~1.6M items in each bucket 3. add node in cluster 10.3.121.116 result: received exactly the same erros as in the Port server memcached on node 'ns_1@10.3.121.112' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use leave the cluster alive for investigation |
| Comment by Jin Lim [ 03/Dec/12 ] |
|
Thanks Andrei. Please leave the cluster while the development team is investigating the issue.
In the mean time please note that: 1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads. 2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of Thanks, Jin |
| Comment by Steve Yen [ 04/Dec/12 ] |
|
from bug-scrub mtg,
looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in. |
| Comment by Farshid Ghods [ 05/Dec/12 ] |
| build 1974 has this fix |
| Comment by Karen Zeller [ 06/Dec/12 ] |
|
Added to RN as:
During Couchbase Server warmup or rebalance, if you delete a data bucket, it will cause the node to crash. |
| Comment by Thuan Nguyen [ 19/Dec/12 ] |
|
Integrated in github-ep-engine-2-0 #461 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/461/]) Result = SUCCESS Jin : Files : * src/warmup.cc * src/warmup.hh * src/ep.cc * src/ep.hh |