[CCBC-64] PHP Client reports Time Out errors on simple SET and GET operations when Rebalance is in Progress Created: 08/May/12  Updated: 13/Nov/12  Resolved: 07/Jun/12

Status: Closed
Project: Couchbase C client library libcouchbase
Component/s: library
Affects Version/s: 1.0.2
Fix Version/s: 1.0.4
Security Level: Public

Type: Bug Priority: Major
Reporter: Hari Subramaniam (Inactive) Assignee: Jan Lehnardt (Inactive)
Resolution: Fixed Votes: 0
Labels: customer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Centos5x OR Ubuntu 11x running Couchbase 1.8.0. 3 Node Cluster, plenty of available RAM and DISK. PHP 5.3. libcouchbase version 1.0.3-1

Attachments: PDF File report.pdf     File test_worker.php    

 Description   
This bug is reported by a customer. Whenever they issue a rebalance(after removing a node from the cluster or add a new node), client operations(PHP - sample code attached) time out with the error below

PHP Warning: Couchbase::set(): Failed to store a value to server: Operation timed out in /root/test_worker.php on line <n>

This issue is consistent for both the customer as well as our in-house reproduction in Support. There are no errors on the server side. Rebalance does complete. And once rebalance is complete, the same script runs fine without any errors.

 Comments   
Comment by Matt Ingenthron [ 09/May/12 ]
PHP Client Library Version installed?
libcouchbase version installed?

CentOS Version?

Also needed is a description of what kind of rebalance is happening here. A "failover" click followed by a "rebalance" is different than a "remove node" followed by a "rebalance". Please outline the specific steps you'd used to reproduce.
Comment by Matt Ingenthron [ 09/May/12 ]
I do see the PHP client version info, sorry I'd missed that. I do still need the libcouchbase version info though.

Initial investigation shows this is likely an issue in libcouchbase, but we need more about the type of rebalance being done. Please comment.
Comment by Hari Subramaniam (Inactive) [ 09/May/12 ]
updated Environment w/OS versions + libcouchbase versions.
Comment by Matt Ingenthron [ 09/May/12 ]
After some investigation with Sergey, it seems that during the 2.5 second period we give to try and re-try an operation if we receive a not-my-vbucket response, we're retrying constantly but the operation times out and thus we send this response up to the PHP client.

To verify this, Sergey set the timeout to 10 seconds, and verified that the issue could not be seen. This is abnormally high though and would indicate that there is something wrong at the server side. There is no reason we should see more than 2.5 seconds for a vbucket to transfer from one node to another during rebalance.

For another level of verification, we'll check to be sure the client is trying both the place the config states the vbucket is active and the place that the ffwd map states the vbucket is going. If this shows libcouchbase is behaving correctly, then the problem is at the server side, not the client side. It's possible there's a bug in the server here that moxi would mask with it's less sophisticated retry algorithm.

It's equally possible that there is a libcouchbase configuration update problem. This next level of verification should tell us for sure.

Note, PHP does not currently give the user the ability to raise the timeout to 10 seconds, but I don't think that's the right solution here. If a single vbucket transfer is taking longer than that, it needs to be addressed at the server side.
Comment by Sergey Avseyev [ 10/May/12 ]
I've analyzed the packets sent to the server and found that libcouchbase doesn't send out in time to network the corrected packet after NOT_MY_VBUCKET error. It is copying it into the internal ouput buffer, but it doesn't hit network. Although if increase timeout it working. So the issue in libcouchbase layer and it isn't php specific. Complete rebalance log could be found here http://files.avsej.net/add-node-rebalance.dump (~200M)
Comment by Matt Ingenthron [ 10/May/12 ]
A fix has been produced, reviewed. http://review.couchbase.org/#change,15882

Will send a verification of the fix to support for either support or the customer to verify the fix, then will include it in the next patch update.
Comment by Sergey Avseyev [ 15/May/12 ]
The patch was merged
Comment by Hari Subramaniam (Inactive) [ 21/May/12 ]
Fix has been verified by the customer. Considering this bug as addressed.
Generated at Wed Aug 20 11:54:29 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.