ClientFailure when Getting concurrently

itay · January 4, 2015, 9:38pm

Hi,
I’m starting to constantly get “ClientFailure” when calling Bucket.Get(aKeys).
Verbosely it says:
“Cannot access a disposed object.\r\nObject name: ‘System.Net.Sockets.NetworkStream’.”

I’m accessing the bucket from several threads in parallel.
I keep the Cluster and the Buckets open and static, without disposing them at anytime (as instructed).

Wrapping the Get() with a lock(this) doesn’t help

Please advise.
Itay

jmorris · January 4, 2015, 10:58pm

@itay -

Somehow the connection is getting closed; this could be on the server side or the client side. If you can post logs that would help.

Note that if you are using the overload of IBucket.Get which takes a IList (the bulk overloads), then you probably don’t want to be doing this within a parallel loop, because internally the client is using TPL; it’s simply too much. If your using regular threads (Thread.Start), then I would probably use one bucket per thread as opposed to sharing a bucket between threads.

-Jeff

itay · January 5, 2015, 8:28pm

@jmorris

I am using batch get. I am not using parallel but Thread.Start.
Assuming that I have 5 buckets and 10 threads, using a different bucket per thread is actually opening 50 buckets on start. 5 sec to open a bucket from localhost and its 4 minutes just to start.

Anyways, that doesn’t make sense as calling get and update should be thread safe AFAIK. Besides, webapps should handle many concurrent ops per sec from different users (= on different threads = many threads).

On top of that, things that worked great before now starting to return with many ClientFailure and OperationTimeout. It happends for single thread, single bucket, single POCO type.

Update1: Batch update with 10 docs usually throws 3-6 docs. Some are saved. Some are not. Failed docs are different every time.

Update2: When batch is ~100s, Bucket.Upsert(items) hangs. It worked perfectly before. From local and remote.

I’m still investigating why.

itay · January 6, 2015, 6:45am

After a night sleep, I looked at the log file and found the following entry for a failed doc:

2015-01-06 08:38:16,709 [16] DEBUG Couchbase.CouchbaseBucket - Operation retry 0 for key xxx. Reason: VBucketBelongsToAnotherServer

I checked the console to see if a rebalance is needed but found no indication.

jmorris · January 7, 2015, 1:30am

@itay -

Have you added or removed a node? That is when you typically get NMV’s.

-Jeff

itay · January 7, 2015, 9:23pm

I didn’t add or remove but rather restarted them (before the incident).
Should this break a cluster ?

I eventually fixed it, I hope, by gracefully failing a node, but this only got me to the next error:

Couchbase.IO.ConnectionPool`1 - No connections currently available on x.x.x.x:11210

I’m opening a new thread.

itay · April 7, 2015, 3:12pm

@jmorris,

It happened again

A specific doc is now inaccessible.

Digging thru the logs found the same scenario - Operation retry 0 for key xxx. Reason: VBucketBelongsToAnotherServer.

I restarted the servers (2 servers, 3.0.1 community, with 1 replica) with no help.
Deleting this specific doc and re-inserting it doesn’t help.
BTW: The doc is easily accessible from the console

I cannot remove a node every time the server mixes vBuckets.
What can I do to recover (Can I manually rebalance ?) and what should I do to prevent it from happening again ?

P.S. Reading from Replica doesn’t help

Update: Graceful failover one of the nodes and then “full recovery” fixed it.

Itay

jmorris · April 8, 2015, 2:55am

@Itay -

Yes, you have to rebalance after adding, removing or failing over a node. The rebalance will distribute the keys equally across the nodes and update the cluster map and vbucket mappings on the client so that the client can retrieve them.

There is a bug in 2.0.3 and it will be fixed in 2.1.0 that addresses this replica read issue.

Glad to hear you resolved it!

-Jeff

itay · April 8, 2015, 5:04am

Jeff,

What can I do programmatically to fix it in real-time, as it is a showstopper event ?

itay · April 8, 2015, 8:40am

It happened again !

A specific doc is accessible from console
JSON is valid
Get() returns ClientFailure with VBucketBelongsToAnotherServer in log4net

I tried again to fail a node and then “Full recovery”. After an hour, it is still not working

now in log4net it also say:
Couchbase.Authentication.SASL.CramMd5Mechanism - Authenticating socket xxx
Couchbase.Authentication.SASL.CramMd5Mechanism - Authentication for socket xxx failed: Auth failure
Couchbase.IO.Strategies.DefaultIOStrategy - Could not authenticate aaa using Couchbase.Authentication.SASL.CramMd5Mechanism - xxx.

I don’t understand the issue with authentication as an adjacent doc can be read successfully.

I’m terrified about using Couchbase on a production system

What am I doing wrong ?

jmorris · April 8, 2015, 6:05pm

@itay -

I can’t be sure exactly what is going on here; probably the best course of action would be to create a NCBC and provide list of steps to reproduce and a sample project.

This doesn’t exactly make sense to me; an NMV is always a server error and never a ClientFailure. ClientFailure’s are errors where the client cannot receive a response from the server or a serialization/deserialization error was raised - the error itself was manifested in the client and not a server response.

With NMV, you should see them propagate to OperationTimedOut if the client cannot resolve the NMV for a Get. Prepend and Append may return an NMV, but this is changing in 2.1.0 (soon to be released).

This can mean two or more things:
a) You provided the wrong credentials
b) The bucket does not exist on the server
c) The bucket was just created and it has not been completely initialized on the cluster (takes a few seconds, so if you are programmatically creating buckets, it may fail in short term unless you delay).

In general, you don’t want to be swapping nodes in and out of a cluster willy-nilly, since it’s a fairly intensive process. Its very easy in CB to do so, but it should be reserved as an operational task when load is not at it’s peak (if possible). I would set up my cluster and then just leave it alone, unless you must do it for some operational reason.

I am not sure; I create, tear down, rebalance etc everyday while developing and load testing and in general things work as expected. Let’s get the NCBC going and take it from the there.

Don’t get discouraged

-Jeff

itay · April 9, 2015, 12:05pm

Hi @jmorris,

Thanks for your effort to help me solve this issue.

What’s NMV ?
Usually ClientFailure does mean a de-serialization error that I eventually find and fix. In this scenario, perhaps it is possible that the data was not returned, or returned null or corrupted and thus caused serialization error. I can also say that the doc JSON integrity is good and that the doc is accessibly through the console.
This Authentication messages occur only to that specific doc while the same connection easily returns other docs in the same bucket, hence the bucket, connection, credentials, etc. are working properly.
Usually removing a node and then restoring it takes an hour of work from me and about the same downtime for the system. So, no, I don’t want to do it at all. Anyway, it doesn’t always help.
Waiting anxiously for 2.1.0
Also waiting for the next community version
I’m trying to hold in there, but it is very challenging
I thought of another reason that might cause this issue. I’m using 2 web servers concurrently with the cluster, so perhaps there is a collision. Check this

jmorris · April 9, 2015, 5:37pm

@itay

1-Not My VBucket
2-It’s possible; you should be able to trace this back in the logs. I highly doubt the server would be the cause of any null, empty or corrupted data…I haven’t heard of any such thing.
3-The connection to a bucket is authenticated, not the data going across the connection; one doc couldn’t succeed and another fail. Also, the authentication occurs when the connection is created, not when it’s used.
…
8-You can open literally thousands of clients (var cluster = new Cluster()) to a Couchbase server cluster, however the best practice is to use the bare minimum necessary. Running two web applications (separate processes, thus separate client instances) is fine. Note that each client instance (new Cluster()) and each bucket you open will create pool of TCP connections. This is controlled by the ClientConfigurationn.PoolConfiguration.MaxSize and MinSize properties

If you are using a single Cluster instance (ClusterHelper will ensure this), you shouldn’t run into any issues. Even if you are using two or more Cluster instances per process, it shouldn’t be a problem.

-Jeff

itay · April 15, 2015, 11:56am

Thanks @jmorris

After a lot of time and effort I think that most of my difficulties arose from a collision between the office local network and the cluster’s network in azure connected through VPN. This resulted in nodes being unreachable.

I’m not sure that everything is OK now but be sure that I’ll let you know.

I hope that this post will help others.
Now I need a vacation

Topic		Replies	Views
High concurrent access results in timeouts and ClientFailure? .NET SDK	11	3774	January 26, 2018
Rebalance client failure .NET SDK	7	2860	September 10, 2015
Bucket Operation Timeout on .GetDocument keeping bucket in bad state .NET SDK	12	3735	December 16, 2015
Couchbase v3 SDK `KvNotMyVBucket` errors after add node + rebalance .NET SDK dot-net	10	2474	February 7, 2023
.NET SDK fails to recover after cluster automatic failover .NET SDK	37	4625	December 27, 2019

ClientFailure when Getting concurrently

Related topics