[JCBC-207] incorrect logic in reconnection threshold leads to never actually reconnecting Created: 09/Jan/13 Updated: 31/Jan/13 Resolved: 31/Jan/13 |
|
| Status: | Closed |
| Project: | Couchbase Java Client |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 1.1.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Ingenthron | Assignee: | Matt Ingenthron |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
In the CouchbaseConnectionFactory, the pastReconnThreshold() method doesn't correctly check the threshold time. It's using millis mixed with nanos.
|
| Comments |
| Comment by Matt Ingenthron [ 09/Jan/13 ] |
|
This is a regression of |
| Comment by Matt Ingenthron [ 09/Jan/13 ] |
|
It turns out this is not a regression. The way the test is being carried out is different in this case.
so, I worked out why this java failover isn't working. it's related to using kill -STOP Here's the current behavior, there's a per-node continuious operation timeout threshold after a given node times out a bunch, the client will drop the connection to that node then it'll try to reestablish it meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap so, the problem... is that when we kill -STOP (instead of an actual cable pull) you can still establish new connections to 11210 so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr because we consider the connection "good" at the time of established, not at the time of sending data maybe that's incorrect to do |
| Comment by Matt Ingenthron [ 09/Jan/13 ] |
|
I think I've worked out an approach with Mark Nunberg's help.
We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well. |
| Comment by Michael Nitschinger [ 09/Jan/13 ] |
| Just as a note, the changesets I've pushed were tested against "freezing" a VM. |