not_my_vbucket errors during rebalance
Hi,
I'm using the latest dalli for ruby to connect to membase. I created a cluster of 2 nodes that have a bucket (with replica 1). I notice when I remove a node from the cluster and do a rebalance, I often get the "not_my_vbucket" errors during the rebalance usually towards the second half of the rebalance. The errors don't go away until the rebalance finishes. It also happens when I add the node back to the cluster as well. Is this normal behavior? Do I need more than 2 nodes and more than 1 replica for this to work without errors during a rebalance?
Thanks,
Chris
Hi Chris, what version of Membase are you using and what port are your using to connect to it?
Perry
1.6.5 using port 11211
I also tried at one point setting up a clientside gateway moxi to connect to the cluster but ran into the exact same issue.
I setup a quick test where I set a key and then continually called get from the dalli library and from telnet while performing a rebalance. I started to get this error from dalli about 75% into the rebalance and it didn't go away until the rebalance finished. In my telnet session, I never got any errors and it always returned the value. This makes me suspect that perhaps this is an issue with the binary protocol?
Thanks for the info Chris, I'll have to do some more digging on our side. That's not supposed to be the behavior so I suspect there is a bug somewhere even though we've got plenty of customers doing just these operations.
I'll get back to you after some more investigation over here.
Perry
Just to go into a bit more detail, the "not my vbucket" error is actually considered "valid" during a rebalance since it is the signal to a client that a particular vbucket has moved from one server to another. However, Moxi (on port 11211) should be masking this error and automatically redirecting the traffic when necessary...that's the potential bug I'm looking into.
Hope this isn't blocking you too much.
Perry
Hey Perry,
Thanks for looking into it. If there's any other information I can provide let me know. It should hopefully be fairly reproducible.
-Chris
Hey Chris, you analysis was spot-on and we discovered a bug with the binary protocol going through Moxi: http://jira.membase.org/browse/MB-3389
I'm working to get this fixed and will follow up with you as soon as I can.
Thanks again.
Perry
Hey Chris, we were able to fix this bug. Thanks again for your help.
You should be able to grab the latest source and download it to test or I can make a build of Moxi available to you directly for verification. This fix will be included in an upcoming build as well.
Perry
Well I tried adding another node to the cluster for a total of 3 nodes. I also recreated the bucket to have 2 replicas. Still get the same error unfortunately. It seems as if once I remove the node that the data resides on, then I temporarily get this error until the rebalance finishes.