Details
Description
In earlier tests with reconnecting to a node on failover we used default memcached bucket. But when we tested the same scenario with a non-default bucket, we noticed the client did not reconnect (due to a null pointer exception internally). I have attached the SDK logs for this scenario where we used "IndexByLniataData" memcached bucket. The problem presents when adding the node back after a failover.
11:34:43,411 DEBUG [Memcached IO over {MemcachedConnection to /10.14.5.119:11210}] [CouchbaseMemcachedConnection] Selecting with delay of 3038ms
Exception in thread "Thread-3" java.lang.NullPointerException
at net.spy.memcached.auth.AuthThread.buildOperation(AuthThread.java:117)
at net.spy.memcached.auth.AuthThread.run(AuthThread.java:86)
Logs/stack trace attached.
11:34:43,411 DEBUG [Memcached IO over {MemcachedConnection to /10.14.5.119:11210}] [CouchbaseMemcachedConnection] Selecting with delay of 3038ms
Exception in thread "Thread-3" java.lang.NullPointerException
at net.spy.memcached.auth.AuthThread.buildOperation(AuthThread.java:117)
at net.spy.memcached.auth.AuthThread.run(AuthThread.java:86)
Logs/stack trace attached.
There is a safeguard already in that the continuous timeout threshold will kick in and then the connection will be rebuilt. I don't know if this issue comes up all of the time, but assuming it's a rare event we'd see 1000 operations timeout (by default) followed by the connection being rebuilt.
We'd have to add some diagnostic information to the client and reliably reproduce this to identify the issue. I think the scenario is:
1) set up a cluster of say 3 nodes
2) configure a client, have it work with an authenticated memcached bucket on the cluster
3) faillover a node by clicking on "failover" in the console
4) add the node back by clicking on "add back"
Is this correct?