How many CAS attempts should I try in a busy production server?


Let’s say I have a document that’s being updated by 1000 users/second.

How likely am I going to get an error from check and set operation? And how many times should I re-attempt check and set in case of an error?

Is there a best practice for such use case?

Not that I actually have such document, but just wanted to find out how I should determine the number of attempts.

In short: “it depends”.

In longer:

Basically you’re asking what’s the likelihood that the document will have been changed between the client previously reading the old value and the server processing the request to update to the new value - and this will depend on a number of factors - some of which are:

  1. How long does it take for the previous document to be returned over the network from the cluster to the client?
  2. How long does the client take to construct the new document?
  3. How long does the request take to get over the network to the cluster node?
  4. How long does the request take to be read and processed by the cluster node?

These are all going to depend on network latency, client speed, cluster load, etc.

With the information you’ve given it’s basically impossible to say. I suggest you do some measurements / experiments on your workload and environment and calculate what the likelihood is and pick a suitable retry count.

Alternatively you could test empirically and see how many retries you need in a representative test scenario.

Thank you @drigby.

Let’s put a context. Let’s say we’re developing a comment system like disqus and we need to support a website with 4000 active users and there’s a popular post that receives 100 comments / second.

I’m thinking about the following document structures.

  • post.1
  • post.1.comments (this is where we store ids of comments)
  • post1.comments.1 (comment 1)
  • post1.comments.2 (comment 2)

In this particular use case, assuming we get 100 comments per second, I’m thinking about setting max # of tries to 20.

I understand that it depends on the factors you’ve mentioned, but is this even a good case?

So the additional piece of data I think you need is the average time taken to process an update.

Assuming your numbers, 100 comments per second on a popular post means there’s a comment every 10ms. You would therefore need to ensure that you can GET the “current” value of the document, insert the new comment and SET the new value back in generally <10ms to not be constantly fighting with other clients.