[JCBC-207] incorrect logic in reconnection threshold leads to never actually reconnecting Created: 09/Jan/13  Updated: 31/Jan/13  Resolved: 31/Jan/13

Status: Closed
Project: Couchbase Java Client
Component/s: None
Affects Version/s: None
Fix Version/s: 1.1.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Matt Ingenthron Assignee: Matt Ingenthron
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
depends on SPY-108 ensure remote server is truly availab... Resolved

In the CouchbaseConnectionFactory, the pastReconnThreshold() method doesn't correctly check the threshold time. It's using millis mixed with nanos.

Comment by Matt Ingenthron [ 09/Jan/13 ]
This is a regression of JCBC-19.
Comment by Matt Ingenthron [ 09/Jan/13 ]
It turns out this is not a regression. The way the test is being carried out is different in this case.

so, I worked out why this java failover isn't working. it's related to using kill -STOP

Here's the current behavior,
there's a per-node continuious operation timeout threshold
after a given node times out a bunch, the client will drop the connection to that node
then it'll try to reestablish it
meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using
that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap
so, the problem...
is that when we kill -STOP (instead of an actual cable pull)
you can still establish new connections to 11210
so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly

but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr
because we consider the connection "good" at the time of established, not at the time of sending data
maybe that's incorrect to do
Comment by Matt Ingenthron [ 09/Jan/13 ]
I think I've worked out an approach with Mark Nunberg's help.

We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well.
Comment by Michael Nitschinger [ 09/Jan/13 ]
Just as a note, the changesets I've pushed were tested against "freezing" a VM.
Generated at Thu Nov 27 19:59:14 CST 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.