Timeout errors in logs

We're seeing a bunch of errors like this in our logs - is there a timeout setting we should be adjusting?

Clients are using the php sdk or and some are using client side moxi - as best we can tell the errors correlate with failed connection attempts by clients. Everything is in the same aws zone and the couchbase servers are provisioned with the extra iops setting. Normally we run a pair of enterprise servers taking advantage of the free tier, right now I threw two more on their to see if it made a difference - it didn't so I'm I'm going to balance them out again tomorrow.

There are 5 buckets, 2 memcache and 3 couchbase. The logs bucket which seems to be the problem is mostly of increment calls - about 100/sec (we keep a counter for each page/hour and increment hits)

Server error during processing: ["web request failed",
{path,"/pools/default/bucketsStreaming/logs"},
{type,exit},
{what,
{timeout,
{gen_server,call,
[vbucket_map_mirror,
#Fun]}}},
{trace,
[{gen_server,call,2},
{menelaus_web_buckets,
build_bucket_node_infos,4},
{menelaus_web_buckets,build_bucket_info,5},
{menelaus_web,streaming_inner,3},
{menelaus_web,handle_streaming,4},
{request_throttler,do_request,3},
{menelaus_web,loop,3},
{mochiweb_http,headers,5}]}] (repeated 1 times)

2 Answers

« Back to question.

Hello,

Yes you can set the timeout for the operation using the method call:
$object->setTimeout($timeout)

You can see the list of operations here:
http://www.couchbase.com/docs/couchbase-sdk-php-1.0/api-reference-summar...

Basic test is available here:
https://github.com/couchbase/php-ext-couchbase/blob/b4ceb4378459a67e71af...

This is related to this section in the developer guide:
http://www.couchbase.com/docs/couchbase-devguide-2.0/about-client-timeou...

Note that your errors are probably related to a cluster/network issue, setting a larger timeout will just "hide" the issue.

Regards
Tug
@tgrall

« Back to question.

Solved. There was a bad machine in the cluster. No errors reported at the aws or system level, newrelic didn't report issues and the couchbase errors logged were not associated with any one machine. However when we rebalanced that machine out and replaced it with an identical one (from the same ami) the error messages stopped.

Best guess is the server amazon allocated has some underlying connectivity issue (it was an m2.2xl in use west 1c if anybody cares )