[JCBC-70] Client fails to reconnect to server of non-default memcached bucket after failover and add back Created: 28/Jun/12  Updated: 30/Jan/13  Resolved: 30/Jan/13

Status: Resolved
Project: Couchbase Java Client
Component/s: Core
Affects Version/s: 1.0.3
Fix Version/s: 1.1.2
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Perry Krug Assignee: Michael Nitschinger
Resolution: Duplicate Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive sc_1_plabq11.dev.sabre.com.out.gz    
Issue Links:
Dependency
depends on SPY-102 Ensure all nodes are in the list of n... Resolved
Duplicate
is duplicated by SPY-111 Assertion/NPE Exception when trying t... Resolved

 Description   
In earlier tests with reconnecting to a node on failover we used default memcached bucket. But when we tested the same scenario with a non-default bucket, we noticed the client did not reconnect (due to a null pointer exception internally). I have attached the SDK logs for this scenario where we used "IndexByLniataData" memcached bucket. The problem presents when adding the node back after a failover.
  
11:34:43,411 DEBUG [Memcached IO over {MemcachedConnection to /10.14.5.119:11210}] [CouchbaseMemcachedConnection] Selecting with delay of 3038ms
Exception in thread "Thread-3" java.lang.NullPointerException
        at net.spy.memcached.auth.AuthThread.buildOperation(AuthThread.java:117)
        at net.spy.memcached.auth.AuthThread.run(AuthThread.java:86)

Logs/stack trace attached.

 Comments   
Comment by Matt Ingenthron [ 22/Aug/12 ]
I've spent a bit of time analyzing this issue, and it's not clear what the cause is. It is correct though that this would cause the auth thread to die, and as such authentication to the node would never complete.

There is a safeguard already in that the continuous timeout threshold will kick in and then the connection will be rebuilt. I don't know if this issue comes up all of the time, but assuming it's a rare event we'd see 1000 operations timeout (by default) followed by the connection being rebuilt.

We'd have to add some diagnostic information to the client and reliably reproduce this to identify the issue. I think the scenario is:
1) set up a cluster of say 3 nodes
2) configure a client, have it work with an authenticated memcached bucket on the cluster
3) faillover a node by clicking on "failover" in the console
4) add the node back by clicking on "add back"

Is this correct?
Comment by Perry Krug [ 23/Aug/12 ]
That appears correct. The customer has been able to reliably reproduce this, but since so much time has passed I would be hesitant in going back to them if not necessary...
Comment by Matt Ingenthron [ 09/Jan/13 ]
There is an open changeset for this. Please determine if it is correct, needs to go in.
Comment by Michael Nitschinger [ 30/Jan/13 ]
Duplicate of Spy-111
Generated at Mon Jul 28 10:39:37 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.