ClientFailure when Getting concurrently

Hi,
I’m starting to constantly get “ClientFailure” when calling Bucket.Get(aKeys).
Verbosely it says:
“Cannot access a disposed object.\r\nObject name: ‘System.Net.Sockets.NetworkStream’.”

I’m accessing the bucket from several threads in parallel.
I keep the Cluster and the Buckets open and static, without disposing them at anytime (as instructed).

Wrapping the Get() with a lock(this) doesn’t help

Please advise.
Itay

@itay -

Somehow the connection is getting closed; this could be on the server side or the client side. If you can post logs that would help.

Note that if you are using the overload of IBucket.Get which takes a IList (the bulk overloads), then you probably don’t want to be doing this within a parallel loop, because internally the client is using TPL; it’s simply too much. If your using regular threads (Thread.Start), then I would probably use one bucket per thread as opposed to sharing a bucket between threads.

-Jeff

@jmorris

I am using batch get. I am not using parallel but Thread.Start.
Assuming that I have 5 buckets and 10 threads, using a different bucket per thread is actually opening 50 buckets on start. 5 sec to open a bucket from localhost and its 4 minutes just to start.

Anyways, that doesn’t make sense as calling get and update should be thread safe AFAIK. Besides, webapps should handle many concurrent ops per sec from different users (= on different threads = many threads).

On top of that, things that worked great before now starting to return with many ClientFailure and OperationTimeout. It happends for single thread, single bucket, single POCO type.

Update1: Batch update with 10 docs usually throws 3-6 docs. Some are saved. Some are not. Failed docs are different every time.

Update2: When batch is ~100s, Bucket.Upsert(items) hangs. It worked perfectly before. From local and remote.

I’m still investigating why.

After a night sleep, I looked at the log file and found the following entry for a failed doc:

2015-01-06 08:38:16,709 [16] DEBUG Couchbase.CouchbaseBucket - Operation retry 0 for key xxx. Reason: VBucketBelongsToAnotherServer

I checked the console to see if a rebalance is needed but found no indication.

@itay -

Have you added or removed a node? That is when you typically get NMV’s.

-Jeff

I didn’t add or remove but rather restarted them (before the incident).
Should this break a cluster ?

I eventually fixed it, I hope, by gracefully failing a node, but this only got me to the next error:

Couchbase.IO.ConnectionPool`1 - No connections currently available on x.x.x.x:11210

I’m opening a new thread.

@jmorris,

It happened again :sleepy:

A specific doc is now inaccessible.

Digging thru the logs found the same scenario - Operation retry 0 for key xxx. Reason: VBucketBelongsToAnotherServer.

I restarted the servers (2 servers, 3.0.1 community, with 1 replica) with no help.
Deleting this specific doc and re-inserting it doesn’t help.
BTW: The doc is easily accessible from the console

I cannot remove a node every time the server mixes vBuckets.
What can I do to recover (Can I manually rebalance ?) and what should I do to prevent it from happening again ?

P.S. Reading from Replica doesn’t help

Update: Graceful failover one of the nodes and then “full recovery” fixed it.

Itay

@Itay -

Yes, you have to rebalance after adding, removing or failing over a node. The rebalance will distribute the keys equally across the nodes and update the cluster map and vbucket mappings on the client so that the client can retrieve them.

There is a bug in 2.0.3 and it will be fixed in 2.1.0 that addresses this replica read issue.

Glad to hear you resolved it! :smile:

-Jeff

Jeff,

What can I do programmatically to fix it in real-time, as it is a showstopper event ?

It happened again ! :rage:

A specific doc is accessible from console
JSON is valid
Get() returns ClientFailure with VBucketBelongsToAnotherServer in log4net

I tried again to fail a node and then “Full recovery”. After an hour, it is still not working

now in log4net it also say:
Couchbase.Authentication.SASL.CramMd5Mechanism - Authenticating socket xxx
Couchbase.Authentication.SASL.CramMd5Mechanism - Authentication for socket xxx failed: Auth failure
Couchbase.IO.Strategies.DefaultIOStrategy - Could not authenticate aaa using Couchbase.Authentication.SASL.CramMd5Mechanism - xxx.

I don’t understand the issue with authentication as an adjacent doc can be read successfully.

I’m terrified about using Couchbase on a production system :anguished:

What am I doing wrong ?

@itay -

I can’t be sure exactly what is going on here; probably the best course of action would be to create a NCBC and provide list of steps to reproduce and a sample project.

This doesn’t exactly make sense to me; an NMV is always a server error and never a ClientFailure. ClientFailure’s are errors where the client cannot receive a response from the server or a serialization/deserialization error was raised - the error itself was manifested in the client and not a server response.

With NMV, you should see them propagate to OperationTimedOut if the client cannot resolve the NMV for a Get. Prepend and Append may return an NMV, but this is changing in 2.1.0 (soon to be released).

This can mean two or more things:
a) You provided the wrong credentials
b) The bucket does not exist on the server
c) The bucket was just created and it has not been completely initialized on the cluster (takes a few seconds, so if you are programmatically creating buckets, it may fail in short term unless you delay).

In general, you don’t want to be swapping nodes in and out of a cluster willy-nilly, since it’s a fairly intensive process. Its very easy in CB to do so, but it should be reserved as an operational task when load is not at it’s peak (if possible). I would set up my cluster and then just leave it alone, unless you must do it for some operational reason.

I am not sure; I create, tear down, rebalance etc everyday while developing and load testing and in general things work as expected. Let’s get the NCBC going and take it from the there.

Don’t get discouraged :smile:

-Jeff

Hi @jmorris,

Thanks for your effort to help me solve this issue.

  1. What’s NMV ? :confused:
  2. Usually ClientFailure does mean a de-serialization error that I eventually find and fix. In this scenario, perhaps it is possible that the data was not returned, or returned null or corrupted and thus caused serialization error. I can also say that the doc JSON integrity is good and that the doc is accessibly through the console.
  3. This Authentication messages occur only to that specific doc while the same connection easily returns other docs in the same bucket, hence the bucket, connection, credentials, etc. are working properly.
  4. Usually removing a node and then restoring it takes an hour of work from me and about the same downtime for the system. So, no, I don’t want to do it at all. Anyway, it doesn’t always help.
  5. Waiting anxiously for 2.1.0 :stuck_out_tongue:
  6. Also waiting for the next community version :stuck_out_tongue_winking_eye:
  7. I’m trying to hold in there, but it is very challenging :cold_sweat:
  8. I thought of another reason that might cause this issue. I’m using 2 web servers concurrently with the cluster, so perhaps there is a collision. Check this

@itay

1-Not My VBucket
2-It’s possible; you should be able to trace this back in the logs. I highly doubt the server would be the cause of any null, empty or corrupted data…I haven’t heard of any such thing.
3-The connection to a bucket is authenticated, not the data going across the connection; one doc couldn’t succeed and another fail. Also, the authentication occurs when the connection is created, not when it’s used.

8-You can open literally thousands of clients (var cluster = new Cluster()) to a Couchbase server cluster, however the best practice is to use the bare minimum necessary. Running two web applications (separate processes, thus separate client instances) is fine. Note that each client instance (new Cluster()) and each bucket you open will create pool of TCP connections. This is controlled by the ClientConfigurationn.PoolConfiguration.MaxSize and MinSize properties

If you are using a single Cluster instance (ClusterHelper will ensure this), you shouldn’t run into any issues. Even if you are using two or more Cluster instances per process, it shouldn’t be a problem.

-Jeff

Thanks @jmorris

After a lot of time and effort I think that most of my difficulties arose from a collision between the office local network and the cluster’s network in azure connected through VPN. This resulted in nodes being unreachable.

I’m not sure that everything is OK now but be sure that I’ll let you know.

I hope that this post will help others.
Now I need a vacation :dizzy_face:

1 Like