Intermittent "Failed to initialize the pool" errors

I have lots of processes that get spawned to do work, and they all create a couchbase client with a minPoolSize of 6, the cluster of 4 nodes has about 1,000 total connections to it. I’m using the .net 1.3.11 client.

Recently I have been getting more and more Failed to initialize the pool errors. The call to CouchbaseClient will actually return after ~25 seconds without throwing an exception, but then future calls to try and read from the CouchbaseClient object will return unable to locate node errors, looking at the couchbase logging it shows the following:

Error Failed to initialize the pool. System.Net.Sockets.SocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
at Couchbase.SocketPool.Create()
at Couchbase.SocketPool.PreAllocate(Int32 capacity)
at Couchbase.SocketPool…ctor(IMemcachedNode node, ISocketPoolConfiguration config, ISaslAuthenticationProvider provider)
at Couchbase.CouchbaseNode…ctor(IPEndPoint endpoint, Uri couchApiBase, ICouchbaseClientConfiguration config, ISaslAuthenticationProvider provider)
at Couchbase.CouchbasePool.CreateNode(IPEndPoint endpoint, ISaslAuthenticationProvider auth, Dictionary`2 nodeInfo)
at Couchbase.CouchbasePool.InitVBucket(ClusterConfig config, ISaslAuthenticationProvider auth)
at Couchbase.CouchbasePool.ReconfigurePool(ClusterConfig config)

Currently the process dies and retries due to these errors, but the 30+ second delay and filling up error logs is not good. Is there anyway to further troubleshoot what could be causing these connection failures? Clearly the cluster is always up as the next try will succeed and other processes that maintain a connection don’t get any errors.

I’m not sure if its related, but I am also getting a lot of errors on calls to get that I am retrying on that look like:

System.ApplicationException: Failed to retrieve key: Exception reading response - —> System.IO.IOException: Unable to read data from the transport connection: A blocking operation was interrupted by a call to WSACancelBlockingCall. —> System.Net.Sockets.SocketException: A blocking operation was interrupted by a call to WSACancelBlockingCall
at System.Net.Sockets.NetworkStream.Read(Byte buffer, Int32 offset, Int32 size)
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.Read(Byte buffer, Int32 offset, Int32 size)
at Couchbase.CouchbasePooledSocket.Read(Byte buffer, Int32 offset, Int32 count)
at Enyim.Caching.Memcached.Protocol.Binary.BinaryResponse.Read(IPooledSocket socket)
at Enyim.Caching.Memcached.Protocol.Binary.MultiGetOperation.ReadResponse(IPooledSocket socket)
at Couchbase.CouchbaseNode.Execute(IOperation op)

For the WSACancelBlockingCall errors, reviewing the source I found the OperationTimeout method in CouchbasePooledSocket logs, going back to the detailed logs I found that before each WSACancelBlockingCall there was an “Operation timeout.” log entry. So in my socketPool configuration I doubled my min pool size and set the receive timeout to 60 seconds just to see if that makes those error go away. The next question is then why is couchbase taking so long to return data for certain calls.

@brandonagr -

Checkout these threads WRT ephemeral port tuning:

and this one:


When analyzing the number of ephemeral ports in use on both the client and couchbase server I see very low port usage. One thing that is different about my usage is that I will typically have relatively low usage followed by a bulk get of 10s of thousands of documents, so the requests per second spike to over 200k during those large requests, I’m thinking there is a correlation between that heavy usage and other connections being delayed, but I’m not sure where to look, the CPU usage of the couchbase process stays under 40% the whole time

One possible cause we just discovered was a misconfiguration of the cluster nodes, 3 of the 4 nodes were on an incorrect subnet causing their network traffic to go through a much slower switch, so it could have just been network congestion / packet loss causing the issue which would fit with what I was seeing (high latency with no apparent cause on either end)