[CCBC-91] timeouts seen after failover, rebalance and add back Created: 13/Aug/12  Updated: 13/Nov/12  Resolved: 18/Aug/12

Status: Closed
Project: Couchbase C client library libcouchbase
Component/s: None
Affects Version/s: 1.0.4
Fix Version/s: 1.0.5
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Matt Ingenthron Assignee: Sergey Avseyev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: PHP 5.3.3 (cli) (built: Jun 27 2012 12:25:48)
CentOS release 5.8 (Final), x86_64
Couchbase Server 1.8.1 Enterprise

Attachments: GZip Archive php-ext-couchbase.tar.gz    

1. Start PHP client in a loop setting and getting against a 2 node cluster
2. Click failover to kick a node out, click rebalance to make it unassociated
3. Walk through the setup wizard on that node, re-add it to the cluster
4. After adding, click rebalance

Expected behavior:
During rebalance in step 4, which is an add node scenario, no timeouts are expected.

Observed behavior:
During rebalance in step 4, I see timeouts from PHP, and they continue even after the rebalance has completed.

Comment by Matt Ingenthron [ 13/Aug/12 ]
A packet capture of this same issue, with the client on MacOS X and CentOS 5.8 servers with Couchbase Server 1.8.1 enterprise edition may be found at http://dl.dropbox.com/u/1537838/failover-maybe-issue
Comment by Sergey Avseyev [ 13/Aug/12 ]
Comment by Matt Ingenthron [ 13/Aug/12 ]
Note from discussion, this is a possible fix, not sure.
Comment by Matt Ingenthron [ 13/Aug/12 ]
Sergey and I reproduced the issue, and it's related to the series of steps outlined above. The underlying libcouchbase is not receiving the updated configuration for some reason, and thus is sending items to the wrong node, and then they're timing out.

Sergey will do more work on finding the specific cause.
Comment by Sergey Avseyev [ 14/Aug/12 ]
The patch http://review.couchbase.org/19599 and aforementioned http://review.couchbase.org/19563 solves the issue.

To reproduce it for sure you should failover the node is currently used by client to listen config changes. (Usually it is the first successfull node from initial node list)
Generated at Sun Sep 14 21:05:07 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.