I am experimenting with a 3 node cluster, to see what happens from a client perspective when failing over nodes and rebalancing a memcached bucket called “cache”.
For my test I created 1000 documents with a known key, I then added a new node and initiated a rebalance. As soon as this is triggered, my code fails to work and continues not to work, I simple get ClientFailure status with no exception when my code tries to read documents.
Looking at the logs it appears that I get some kind of auth failure:
2015-08-27 15:38:15,391 [Read thread #1] DEBUG Couchbase.IO.ConnectionBase - Complete opaque72 on 4b7e3a42-2a17-46b4-9e09-896e8249e9af
2015-08-27 15:38:15,391 [Read thread #1] DEBUG Couchbase.Authentication.SASL.CramMd5Mechanism - Authentication for socket 4b7e3a42-2a17-46b4-9e09-896e8249e9af failed: Auth failure
2015-08-27 15:38:15,391 [Read thread #1] DEBUG Couchbase.IO.Strategies.DefaultIOStrategy - Could not authenticate cache using Couchbase.Authentication.SASL.CramMd5Mechanism - 0e8155b4-08b0-4094-923e-ce03fb5a0aba.
2015-08-27 15:38:15,391 [Read thread #1] DEBUG Couchbase.IO.Strategies.DefaultIOStrategy - System.Security.Authentication.AuthenticationException: cache
at Couchbase.IO.Strategies.DefaultIOStrategy.Authenticate(IConnection connection)
at Couchbase.IO.Strategies.DefaultIOStrategy.Execute[T](IOperation`1 operation)
Using Couchbase Server 3.1.0 and SDK 2.1.4 I tried to reproduce with your code and could not. I started with a 3 node cluster, then added a node and re-balanced. I then removed a node and re-balanced and did that again. Finally i added both nodes back into the cluster and re-balanced.
I never encountered any authentication or ClientFailure. What version of Couchbase Server were you using? Did the memcached bucket “cache” have a password?
The “cache” bucket do have a password, when I removed it rebalance works as you described it, I also did not get any errors.
I got another bit strange behaviour from the web interface, ops per seconds stays at 0 during my load test, only store operations causes ops per second to increase. On a production cluster ops per second is reported accurately (I assume). What could cause it to stay a 0? When debugging, I do get bytes back for the document I try to get, so there is NO functional issues with the client or server.
I couldn’t replicate this issue with a Memcached bucket and password. I noticed in your code, that you weren’t passing the password into ClusterHelper.OpenBucket so you may want to check this. I closed the ticket for now, but if you can provide steps to reproduce I’ll re-open it.
Also, the 0 ops in the Management Console may be a bug in the console; I noticed the same thing doing Upsert, but when I tabbed to another tab and back, it was working.
About the 0 ops, I am not using Thread.Sleep at the moment, but I will see if I can do some more test on this, maybe try a new install or rewriting the client part.