[MB-5406] ep-engine drops connections too fast when rebalancing out a node Created: 30/May/12  Updated: 22/Apr/13  Resolved: 05/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 1.8.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Matt Ingenthron Assignee: Andrei Baranouski
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File MB-5406-droppedatend.log     GZip Archive MB-5406-testprogram.tar.gz    
Release Note

When investigating an issue on the java client library with retrying operations based on not-my-vbucket responses, I've noticed that at the end of a rebalance removing a server, the server being removed will drop the connection while operations are in flight.

There would be a period of time when the bucket transitions from active to dead, after the takeover, when it would only respond with not-my-vbucket replies.

Unfortunately, the current behavior makes application code, at best, need to handle more complex failure logic. At worst, unhandled by the application it could lead to data loss.

The challenge here is determining the period of time. Some clients do not disconnect, and there is no server polite hangup.

The attached log demonstrates the issue, and the attached test program will let one observe it. This test was carried out by:
1) Set up 3 node cluster with a default bucket which is of the Couchbase type
2) Start the test program, first argument is number of seconds to run, arguments after that are hostname/ips for the nodes in the cluster
3) Remove a node from the cluster

Expected behavior: All operations sent to the server receive a not-my-vbucket reply and are rescheduled as we receive config updates from the server.

Observed behavior: At the end of the remove server/rebalance cycle, the connection is dropped and in-flight operations will be canceled by the client, since it doesn't really know the status of those operations.

Comment by Matt Ingenthron [ 30/May/12 ]
Log showing the connection dropped, followed by canceled operations (the client cancels operations in flight) followed by timed out operations (the client will reconnect, because these operations have already been queued for this server, which was the valid place for data at the time the operation was requested by the application). Note this all occurs over about a 6 second interval.
Comment by Aleksey Kondratenko [ 30/May/12 ]
Hm. I didn't think about this. Thanks for pointing this out Matt.

We've just merged code that shuts down memcached more quickly at failover and on rebalance out. I didn't think it would cause problem like that. On failover I think it's ok to loose few requests. But rebalance out should indeed be graceful. I can stop requesting that quick memcached shutdown on rebalance out. And instead do that memcached killing on joining node back.
Comment by Farshid Ghods (Inactive) [ 30/May/12 ]
moving this to 2.0 because this is existing behavior from 1.7.x
Comment by Matt Ingenthron [ 30/May/12 ]
When you say you've just merged code, is that in the 1.8.x development branches, or in 2.0.x?

A more correct fix would probably be to add some kind of hangup to the protocol, but simply waiting until nn seconds have passed without receiving any data is good enough as well. I'm guessing from your statement though that this faster is in ns_server, which means we don't have visibility into data received. Well, we have stats I guess.

I know it's a longshot, but I would appeal to fix this in the upcoming 1.8.x release if at all possible.
Comment by Aleksey Kondratenko [ 30/May/12 ]
Not quite.

1.8.0 will delete all buckets and shutdown memcached on rebalance out. In the process hopefully all connections would be gracefully bounced.

1.8.1 now just requests _exit from memcached. So that's fixable. Assuming of course memcached & ep-engine & bucket-engine "trinity" does that graceful connection bouncing. And assuming that this bouncing is even possible. Thinking a bit more I'm not sure about that anymore.

I want to cc Trond but that thing doesn't allow me.
Comment by Aleksey Kondratenko [ 30/May/12 ]
CC-ing Trond
Comment by Aleksey Kondratenko [ 30/May/12 ]
un-cc-ing Trond
Comment by Aleksey Kondratenko [ 30/May/12 ]
Matt, this code is 1.8.1 and merged up to master.

Grace period of N seconds may work. But there's always possibility of client being really not up to date with respect to vbucket map. More than N seconds not up to date.

Perhaps if all our mutations are idempotent then client can just resend them freely. But I'm not at all sure they are. Especially given AFAIK 1.8.x doesn't send CAS over TAP connections.
Comment by Matt Ingenthron [ 30/May/12 ]
I've added Trond as a watcher.

I believe the CAS identifier is sent over the TAP connection, but in any event code that's written to do CAS operations is expecting to receive OK or CAS_MISMATCH. The client library interposes for NOT_MY_VBUCKET and uses that as a trigger to reschedule the operation.

Clients could be taught to automatically retry idempotent operations that are sent out and the connection is dropped, but a more general solution would be just to make sure we send NOT_MY_VBUCKET for a time. This will allow us to handle other operations as well using already written code. Anything else is a bit of a redefinition of the protocol. I'm not opposed to that if it's the right thing to do, but in this case I think the original definition given to me is probably best.

Dustin said long ago that technically speaking, a client need not receive streaming updates or a fast forward map in order to implement the protocol and handle transfers of vbuckets. Upon receiving a NOT_MY_VBUCKET the client should be able to go retrieve a new map, then try the new place. To make this reality, we'd need to either add the responses for a time, or add some kind of hangup, tied to a specific request, which the client could use to then reschedule/retry each subsequent request. I still think the simple wait for a while is probably the best.
Comment by Farshid Ghods (Inactive) [ 12/Jun/12 ]
can we keep the connection open for a few seconds after the last vbucket is rebalanced out ?
Comment by Aleksey Kondratenko [ 12/Jun/12 ]
That's not going to solve this problem. Clients can be idle for minutes and then try to send something and then discover badness. And that 'discover' badness can essentially happen in the middle. I.e. after command was in fact executed.
Comment by Aleksey Kondratenko [ 12/Jun/12 ]
Assigning back to Dipti
Comment by Matt Ingenthron [ 02/Jul/12 ]
With respect to being "idle for minutes", clients receive configuration updates and should handle them pretty quickly. While I agree that staying up for a few seconds isn't a proper fix, it's probably a good enough workaround for most cases for right now.

Protocol wise we have a problem though, since there is no way to signal a polite hangup.
Comment by Aleksey Kondratenko [ 02/Jul/12 ]
We can obviously add some small delay. How much do you think should be enough ?

Note however that we've discussed in past that not all SDKs and deployments really have to pull streaming vbucket map updates, in that case things will fail.

For idle connection any new request will AFAIK currently cause bucket engine to silently drop connection. I.e. bucket engine (AFAIK) does not drop all of them asap. I think it can send some new response code and them shutdown connection gently. That seems like better fix. No ?

Well currently ns_server restarts memcached in forceful way right after failover that will also cause connections to be dropped on the floor. We can deal with that by not exiting old memcached but instead asking it release as much resources as possible (including listen-ed ports) except currently established connections, so that this gentle hangup is possible.
Comment by Matt Ingenthron [ 02/Jul/12 ]
I do think as little as a few seconds would do it in most cases. Going high, a 10s delay, may actually be good.

If it's in response to a request, something that indicates polite hangup and reconfigure would be great. I was mostly referring to the fact that in the current protocol, we have no room for unsolicited responses.

I'm a little less concerned in the case of a failover, but it would be good to make that better too. The main concern was that in the case of a graceful "remove" server, it'd be unexpected to have operations fail. The clients gracefully handle things in all situations except the dropped connections at the end, which is before it gets a final config update.
Comment by Aleksey Kondratenko [ 02/Jul/12 ]
My point was that we don't actually need unsolicited responses.

When bucket gets deleted, bucket-engine/memcached don't really drop those connections. Instead when any request arrives, _then_ bucket engine drops connection.

Here's how typical operation is handled in bucket_engine:

 * Implementation of the "get" function in the engine
 * specification. Look up the correct engine and call into the
 * underlying engine if the underlying engine is "running". Disconnect
 * the caller if the engine isn't "running" anymore.
static ENGINE_ERROR_CODE bucket_get(ENGINE_HANDLE* handle,
                                    const void* cookie,
                                    item** itm,
                                    const void* key,
                                    const int nkey,
                                    uint16_t vbucket) {
    proxied_engine_handle_t *peh = get_engine_handle(handle, cookie);
    if (peh) {
        ENGINE_ERROR_CODE ret;
        ret = peh->pe.v1->get(peh->pe.v0, cookie, itm, key, nkey, vbucket);

        if (ret == ENGINE_SUCCESS) {
            TK(peh->topkeys, get_hits, key, nkey, get_current_time());
        } else if (ret == ENGINE_KEY_ENOENT) {
            TK(peh->topkeys, get_misses, key, nkey, get_current_time());

        return ret;
    } else {
        return ENGINE_DISCONNECT;

I.e. if get_engine_handle spots that connection belongs to dead bucket instance it'll return NULL which will normally (as in this case) cause engine operation to return ENGINE_DISCONNECT.

Instead of just disconnecting we can actually do better by sending something back and then disconnecting. And that doesn't look like massive change.
Comment by Matt Ingenthron [ 02/Jul/12 ]
Yep, I'd gotten that we don't need unsolicited responses, but that still depends on traffic coming through (to your original point about idling earlier). I think a delayed shutdown with a new response should handle most all cases though.
Comment by Dipti Borkar [ 03/Aug/12 ]
Alk, given this discussion seems like a delay (10s) would be the best approach moving forward. Assigning this back to you for 2.0.
Comment by Aleksey Kondratenko [ 03/Aug/12 ]
Trond I think we can do better (at least on *nix, we need to work out things for windows too).

When node detects it was failed over, it sends that 'die!' command. Current memcached simply does _exit.

What if we instead:

* find all connected sockets and unmark them as CLOEXEC
* exec into special 'zombie' program that'll thus inherit those sockets but will close our listen sockets and free memory
* that zombie program will send back errors to any client request (even partial) it sees

Makes sense?

Or we can do with plan A which is delay 10 seconds before asking memcached to die.
Comment by Trond Norbye [ 06/Aug/12 ]

I don't think this is an easy task to perform, because at the time the input listener is running all of the other threads in memcached is still running. That means that the worker threads will service clients and we will accept new clients. We could extend the "normal" shutdown part where we let all of the worker threads complete their current call into ep-engine and then terminate. At this point we know that no one else is touching the sockets from the frontend, but we still got all of the threads running in the backend. The next problem would now be the "protocol synchronization" on the socket. According to the protocol specification the server can't send untagged messages to the client, so the only thing we may do is to send an answer to a command. Some of the sockets may have partially written data to them, so we need to figure where in the message etc we're at in order to figure out what to send to the client (and then only send the not-my-vbucket response etc) back on new commands. (in addition to that it would be the portability concerns, but I guess the easiest thing there would be to close the listening sockets, and then notify ns_server via it's stdout/err thing with a special notice that's the listen port should be available again and let the current process keep on running).

I do think it would be easier to solve this from ns_server by extending it to set all of the vbuckets to "dead" on the node it is going to kill (this will cause memcached to send out not-my-vbucket for all of the requests sent to that node)? We can extend memcached with a command (over the stdin listener) that release the ports it has bound so that ns_server may execve a new one if it likes while it is waiting a few secs before killing the old one...

Alk what do you think about that?

Comment by Aleksey Kondratenko [ 06/Aug/12 ]
Background threads would be killed by exec-ing or fork-ing+exec-ing. But indeed I forgot about partially send replies.

Deleting vbuckets is not going to be easier than just deleting buckets. Plus we will have to do something with memory usage of 'old' memcached instance and it's data dirs usage too.

So plan A then. I'll initiate forced buckets shutdown and 10 seconds later will ask for "die!"
Comment by Matt Ingenthron [ 06/Aug/12 ]
FWIW, plan A is good by me!
Comment by Aleksey Kondratenko [ 05/Oct/12 ]
Uploaded to gerrit.
Comment by Farshid Ghods (Inactive) [ 08/Oct/12 ]
this is now waiting to Andrei to fix the tests:
Comment by Andrei Baranouski [ 22/Apr/13 ]
Delay 10 seconds has been added, where necessary
Generated at Fri Aug 22 05:36:42 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.