Improving document retrieving velocity

We have actually 2 nodes of Couchbase with 16GB of ram and 4 cores. We are planning to switch to 3 nodes of Couchbase, all with 32GB of ram. We made some benchmark tests and we discovered that the performance of the get operations (with multiple ids) were the same no matter how much ram Couchbase server and/or buckets they had. This let us think that probably the amount of ram we have is enough to cache the most used keys we have, so more ram doesn’t impact performances.

All our operations are executed by chunking the keys to retrieve into a batch of documents, actually 2500, because getting more documents the script executed on the application server crashes with the error Client Side Timeout. How can we avoid this error and speed up the retrieve operation?
Is there a limit on the maximum number of documents retrieved by the (multi) get operation? Or is it related to the size of the cluster on which Couchbase is running?

The documents used on the test have a size of about 2kb and a key of about 15 bytes.

Hi @sovente,

Some questions for you to get a better picture

What is the version of couchbase server?
Which sdk are using for the benchmark tests? (Can you share the benchmark code)

There may be a limit in the underlying sdks architecture on the number of requests to be batched. Client side timeout can be caused a number of reasons - network issues, server response was slow, client issue… We can better understand the issue if we know the answers for the above questions.

Cheers,
Subhashni

1 Like

Are you interested in maximising throughput or latency?

If it’s the former then you can essentially batch as large as you like, but consider that you’ll get diminishing returns after a while (document size / environment -dependent).

If it’s the latter then you basically don’t want to batch at all (i.e. request individual documents), and do it across a number of independent connections - that way one document taking a bit longer to come back doesn’t affect other requests.

Thank you @subhashni, we are using Couchbase 4.1 CE and PHP SDK 2.2.0 (PHP 7.0.9-1) with the following code using PHP generators inside a foreach:

$chunkedIDs = array_chunk($IDs, $multiGetSize);
foreach($chunkedIDs as $chunk){
  try{
    $response = $bucket->get($chunk);
  }catch(\Exception $e){
    throw new StorageException("Error: ".$e->getMessage());
  }
  $keys = array_keys($response);
  $keysCount = count($keys);
  for($i=0; $i<$keysCount; $i++){
    $key = $keys[$i];
    $value = &$response[$key];
    yield $key => $value;
  }
}

We tried to execute more concurrent scripts but the overall throughput does not change across 2 or 3 nodes. So, what is the right way to maximize throughput? What does the number of nodes and the ram available afftect?

Hi @drigby, we are interested in maximising throughput, how can we determine the optimal batch size? Is it influenced by the total ram or number of nodes in the cluster?

Having about 2kb document size is correct? Or is it better to have smaller/larger documents to optimize performances?

Thank you!

It’ll depend on many factors - in addition to what you mentioned it’ll also depend on the network latency between your clients & cluster.

I suggest you benchmark your application / environment with a variety of batch sizes and see what works best for you.

Thank you @drigby, we’ll benchmark our infrastructure every time we need to change it.

Thanks @sovente. There is a couchbase guide for sizing http://developer.couchbase.com/documentation/server/current/install/sizing-general.html. If more of your dataset fits in memory, throughput will be better. @avsej here can help you with the best practices for php.

Great guide, it was what we were missing. We are now able to benchmark our infrastructure to maximize throughput.

Thank you very much for the help!