How to perform REAL multi Get

drak25 · January 6, 2015, 10:55am

I tried to use multiget method from 2.0 sdk :

public IDictionary<string, IOperationResult> Get(IList keys)

but found out that the performance of this method is really poor. I was asthonised to discover byt looking at the source of the library, that this method is a total cheat. It performs get sequentialy but in multiple threads :

    public IDictionary<string, IOperationResult<T>> Get<T>(IList<string> keys)
    {
        var results = new ConcurrentDictionary<string, IOperationResult<T>>();
        var partitionar = Partitioner.Create(0, keys.Count());
        ParallelOptions op = new ParallelOptions() { MaxDegreeOfParallelism = 1 };
        Parallel.ForEach(partitionar,op, (range, loopstate) =>
        {
            for (var i = range.Item1; i < range.Item2; i++)
            {
                var key = keys[i];
                var result = Get<T>(key);
                results.TryAdd(key, result);
            }
        });
        return results;
    }

How to perform a real multiget in .net i mean get 1000 keys in JUST ONE REQUEST ?

Regards
Piotr

ingenthr · January 6, 2015, 5:29pm

I’m sure @jmorris will be able to answer more fully with respect to the .NET SDK, but I did want to mention that at a protocol level, a multi-get is actually a pipelined series of gets with asynchronous IO. That’s the best way to use the binary protocol when fetching many operations.

jmorris · January 6, 2015, 6:32pm

@drak25 -

Per @ingenthr remarks, this is the best way to parallelize gets: the vbucket lookup, serialization, etc are parallelized as well as the transport layer. Internally the client will use a connection pool to distribute the operations across the connections; you can use the ConnectionPool.MaxSize to change tune the connection pool.

As for performance, I have seen up to > 40k OPS when using this method and using the appropriate overload. To get the best performance, you’ll need to tune the ParallelOptions.MaxDegreeOfParallelism for your hardware and chose an optimal partition size.

On another note, the code you posted here is not correct: the MaxDegreeOfParallism is hardcoded to 1, which effectively will make the client use one threadpool thread for every request, which would likely be slower than just using the main thread due to the overhead of the threadpool. Checking the source, I don’t see this code on the master branch, so perhaps you are looking at old code? If so, that may be why you are seeing “poor” performance.

-Jeff

drak25 · January 6, 2015, 7:43pm

I tested old .net SDK (1.3) and verified that multiget method used a single tcp/ip packet to send multiple keys to couchbase server. Why it is not practiced anymore in 2.0 library ?

ingenthr · January 6, 2015, 8:40pm

That’s interesting. Note that how the requests are divided among packets and the sending of requests are different. That single packet could well have had multiple memcached opcode requests in it. While the SDK doesn’t directly control how TCP is packetized, maybe some of the APIs we used are different here or we have different TCP_NODELAY settings. Sounds like something to be investigated.

What can you tell us about the network setup for both the 1.3 and 2.0 client to the server? Physical network or localhost?

Being broken up into more packets is sometimes desirable and sometimes not. Generally there’s slightly more CPU time for processing more small packets, but you get lower latencies for it (as long as you have the resources). There are times people want higher possible throughput through buffering more and times that people want the lowest latency per operation possible. It’s one of those places where latency and throughput are in tension.

I tested this myself with an experiment with a different SDK ©, and found no significant difference in my test rig which was two Linux systems talking to each other between when I used TCP_NODELAY and not. I definitely had a higher packet count, but the max throughput was about the same and resource usage difference was not really measurable. I didn’t study latencies

Of course, if you’re seeing a big difference between 1.3 and 2.0 on throughput or latency at the app level, that’s a different thing entirely. Other than the packet difference, what are you seeing?

drak25 · January 7, 2015, 11:42am

I actually performed a performance comparision for versions 1.3.9 and 2.0.0.1 (both from nuget).

        var cluster = new Cluster("couchbasev2/couchbase");
        var bucket201 = cluster.OpenBucket("fitting");
        var client13 = new CouchbaseClient("couchbase", "fitting", "");

        List<string> keys = new List<string>();

        for (int ii = 0; ii < 1000; ii++)
        {
            string k = "tk" + ii;
            client13.Store(StoreMode.Set, k, "1");
            keys.Add(k);
        }
        while (true)
        {
            Stopwatch swx = new Stopwatch();
            swx.Start();
            var mg1 = client13.Get(keys);
            swx.Stop();
            Console.WriteLine("Lib 1.3.9: " + swx.ElapsedMilliseconds + "ms");
            swx.Reset();
            swx.Start();
            var mg2 = bucket201.Get<string>(keys);
            swx.Stop();
            Console.WriteLine("Lib 2.0.1: " + swx.ElapsedMilliseconds + "ms");
            Console.ReadKey();
        }

Average Results are
Lib 1.3.9: 57ms
Lib 2.0.0.1: 202ms

This does not change if i provide ParallelOptions with MaxDegreeOfParaleism 10 for 2.0.0.1 get

This is a real example demonstating that version’s 2.0.0.1 multiget is nowhere near version’s 1.3.9 multiget performance.

V2 pool configuration used:

  <couchbasev2>
<couchbase>
  <servers>
    <add uri="http://srvex:8091/pools"></add>
  </servers>
  <buckets>
    <add name="fitting">
      <connectionPool name="custom" maxSize="100" minSize="15"></connectionPool>
    </add>
  </buckets>
</couchbase>
 </couchbasev2>

jmorris · January 9, 2015, 4:49am

@drak25

I took the code you posted and split it into three different console applications; the only difference was that each application took that loop (the while(true)) and instead looped 10 times through the Get and each one was dedicated to a specific SDK version: 1.3.11, 2.0.0.1 and 2.0.1 (which will soon be released). I ran each against the same Couchbase instance (3.0.2-26 Enterprise Edition (build-26)) running on localhost. I used the same bucket: the default bucket.

Here is the code an results for 1.3.11:

class Program
{
     static void Main(string[] args)
     {
           var client13 = new CouchbaseClient("couchbase", "default", "");

           List<string> keys = new List<string>();

          for (int ii = 0; ii < 1000; ii++)
          {
             string k = "tk" + ii;
             client13.Store(StoreMode.Set, k, "1");
             keys.Add(k);
          }
          var count = 0;
          while (count++ < 10)
          {
                Stopwatch swx = new Stopwatch();
                swx.Start();
                var mg1 = client13.Get(keys);
                swx.Stop();
                Console.WriteLine("Lib 1.3.9: " + swx.ElapsedMilliseconds + "ms");
                swx.Reset();
           }
           Console.ReadKey();
           } 
    }

Here is the screenshot:

Here is the code for 2.0.0.1 (the 2.0 SDK on NuGet):

class Program
{
    static void Main(string[] args)
    {
        var cluster = new Cluster("couchbaseClients/couchbase");
        var bucket201 = cluster.OpenBucket();

        List<string> keys = new List<string>();

        for (int ii = 0; ii < 1000; ii++)
        {
            string k = "tk" + ii;
            bucket201.Upsert(k, "1");
            keys.Add(k);
        }

    var count = 0;
    while (count++ < 10)
    {
        Stopwatch swx = new Stopwatch();
        swx.Start();
        var mg2 = bucket201.Get<string>(keys);
        swx.Stop();
        Console.WriteLine("Lib 2.0.0: " + swx.ElapsedMilliseconds + "ms");
    }
    Console.Read();
 }
}

And here is a screenshot of the results:

I also did the same on a build of the next version of the SDK to be released (soon), 2.0.1 - the code is the same, so I won’t post, but here is the screenshot:

As you can see between 1.3.11 and 2.0.0.1, the performance is much closer; 2.0.0.1 has much more variance though with the max of 108ms and a min of 26ms for an avg of 44.8ms. 1.3.1 has a max of 68ms and a min of 39ms for an avg of 43.5ms. That is pretty much even for performance; note that the first loop takes the longest for both. Now, for 2.0.1, things look much better: the max is 38ms and the min is 21ms for and avg of 27.7ms - a big improvement over both. The improvement can be attributed to a re-worked IO/transport layer and refactoring; the client should be even faster in subsequent releases (remember it’s new code, so it will take a bit to optimize).

The reason the implementation is as it is (and not the same as 1.3.X), is that when it was tested, it was pretty close performance-wise to 1.3.X, but much less complex. Now, that being said, the internal implementation may change in future releases, but only if it’s proven to be substantially faster than the current implementation and more stable.

Thanks!

-Jeff

drak25 · January 9, 2015, 7:45am

The difference in our test procedures is that i used a network couchbase server. It is in a local network. I suppouse there is additional performance hit due to network layer. The performance hit becomes substantial when you perform get 1000 times instead of 1 time.

Thank you for your effort. Will try the new version as soon as it is released.

drak25 · January 9, 2015, 8:05am

I thought that maybe lower performance in my case is cuased by my local network, which is kinda dated and may be slow. I created a more reliable testing environment in my production network (virtual local network, oktawave.com). This network should be fast enough.

I created a two server cluster of newest available couchbase enterprise, and modified my code to perform get 10 times.

This is the result:

Performance of 1.3.9 vesion is really close to what you are showing for localhost.
Version 2.0.0.1 still lags behind.

Regards
Piotr

itay · January 9, 2015, 12:28pm

Hi all,
I was starting to write a similar question when I found this and I want to share my experience comparing multi get with many single gets.
Server is remote. C#. .Net 2.0.0.1
Running 10 single GET takes ~10*83ms (exact numbers are less significant here)
Running 1 multi GET take ~830ms
Same time per item which make me think that perhaps my configuration is not set to parallel.

Where can I control multi GET’s MaxDegreeOfParallelism and/or bucket’s connectionPool (I’m opening a bucket using Cluster.OpenBucket() and not from web.config ?

Furthermore, to add to the above discussion, Getting 100 items in one single call is by every mean better than invoking 100 separate calls, either wrt threads, CPU, latency, etc.

drak25 · January 9, 2015, 12:43pm

Regarding MaxDegreeOfParalleism - you got a method override which you can use to provide ParrallelOptions object.

itay · January 9, 2015, 12:57pm

@drak25 Thanks, I missed that
Set to 10 in parallel, yet results remained the same

jmorris · January 9, 2015, 4:29pm

@drak25 and @itay -

Good points and probable correct. I created a Jira ticket to investigate deeper: https://issues.couchbase.com/browse/NCBC-781

As for MaxDegreeOfParallism, you want it to at or less than your virtual cores on your application server. More will actually make it slower.

Thanks!

-Jeff

itay · January 9, 2015, 6:47pm

I set MaxDegreeOfParallism to 1/2/4 on a Quad machine and got the same results. setting to 10 did perform worse

Furthermore, as per this post, Batch Get, should better use async Task Asynchrony Pattern (TAP) and not TPL Parallism because the bottleneck is network latency and processing delay and not the client’s CPU, just as used for Views.

Saying that, and I quote myself, a real batch request should be even better:

Getting 100 items in one single call is by every mean better than invoking 100 separate calls, either wrt threads, CPU, latency, etc.

krumplib430 · January 16, 2015, 10:03am

Hi,
We are experiencing the same issue with 2.01 client. In our tests we run a bulk get operation of 1000 items repeated 100 times. For the 1.3.10 client the whole operation took about 10 sec, the 2.01 client took more than 50 sec. For the tests, I have used a remote cluster consisting of 3 nodes (3.01 community edition). I played around with the MaxDegreeOfParallelism with no avail.

We are using couchbase as a distributed cache in a performance critical role. Unfortunately the 2.01 bulk get performance is a deal breaker for us, so we can’t make the switch until it is sorted out (although we really like the new features the 2.01 offers, especially the replica reads).
Can you guys please vote for the jira issue mentioned by jmorris? Maybe it will get higher priority.
Thanks,
Bence

drak25 · January 16, 2015, 10:22am

krumplib430 Thank you for your input. I have just voted for the jira issue.

itay · January 24, 2015, 1:14pm

Follow up here

krumplib430 · April 22, 2015, 2:03pm

Hi jmorris,

Do you have any planned release date for 2.1? We would really like to use the new client library because of the new features, but we can’t make the move until the BulkGet performance is sorted out.

Thank you,
Bence Farkas

jmorris · April 22, 2015, 3:01pm

@krumplib430

2.1.0 is planned for the first week of May: https://issues.couchbase.com/browse/NCBC/fixforversion/12504

-Jeff

drak25 · May 26, 2015, 2:23pm

Hey Jeff,

This issue is beeing moved from version to version. Is it possible to known when it is going to be implemented ?

Regards
Piotr