Trying to insert data into Couchbase as suggested in bulk operations but always face time outs

I am triyng to insert multiple data into CB without success. I assume that this shouldn’t be a problem for a such DB engine to insert 100K document (by using suggestion in Bulk Operations | Couchbase Docs). But I am geting timeouts in my local environment with 12 core and SSD. After some time CB gives time-outs. The .Net SDK’s retry logic runs but it takes an other time-out and at the end .net exception throws.

What is the reason for those timeouts? Why the server can not answer this concurreny? Or is it the client SDK causes this problem (it is CouchbaseNetClient 3.2.9)

This is not a big workload, so increasing timeout shouldn’t be a solution.

For example when i try SQL server with such workload i don’t get time out. So how i can be sure in prod environment that we will not face time outs?

bucket:
Durability Level is : Major and persist to active
no Replica

document:

{
	"type": "productX",
	"productSourceId": 642930,
	"productCode": "XXXXX",
	"productName": "W/CXX/8/XXXXX/XX",
	"class": "Cosmetics",
	"assetGroupId": 110291,
	"barcode": "88888888888888",
	"sizeId": 248,
	"size": "STD",
	"colorId": 0,
	"color": "BR77",
	"colorFamily": "yyyyyyyy",
	"colorName": "FFFFF",
	"imageUrl": null,
	"webName": "",
	"createdDate": "2020-12-09 20:12:33.8272886 +03:00",
	"subDivisionId": 250,
	"subDivision": "wwwwww",
	"seasonId": 4244,
	"season": "SS",
	"productId": 642930
}

.net code:

async Task InsertParallelAsync()
{
    Console.WriteLine("---InsertParallelAsync");
    var cluster = await Cluster.ConnectAsync($"couchbase://localhost", "Administrator", "111111");
    var collection = (await cluster.BucketAsync("Test")).DefaultCollection();
    var itemList = System.Text.Json.JsonSerializer.Deserialize<List<ProductItem>>(System.IO.File.ReadAllText("product_100k.json"));
    var taskList = new List<Task<IMutationResult>>();

    var sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    foreach(var item in itemList)
    {
        var task = collection.InsertAsync(Guid.NewGuid().ToString(), item);
        taskList.Add(task);
    }
    await Task.WhenAll(taskList);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

exception:
An exception of type ‘Couchbase.Core.Exceptions.AmbiguousTimeoutException’ occurred in System.Private.CoreLib.dll but was not handled in user code: ‘The operation /556 timed out after 00:00:05. It was retried 1 times using Couchbase.Core.Retry.BestEffortRetryStrategy.’

Additionaly,
I tried Couchbase.Extensions.MultiOp which is the extension library to overcome this isue. Bu the performance is not acceptable. Because inserting documents in a sequentian way without using any parallelisim gives the same (aproximatly) performance.

Inserting 100K document using Couchbase.Extensions.MultiOp;

async Task InsertParallelWithOptimizationAsync()
{
    Console.WriteLine("---InsertParallelWithOptimizationAsync");
    var cluster = await Cluster.ConnectAsync($"couchbase://localhost", "Administrator", "111111");
    var collection = (await cluster.BucketAsync("Test")).DefaultCollection();
    var itemList = System.Text.Json.JsonSerializer.Deserialize<List<ProductItem>>(System.IO.File.ReadAllText("product_100k.json"))
    .Select(e =>
    { 
        e.id = Guid.NewGuid().ToString();
        return e;
    }).ToDictionary(e => e.id);
    var sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    var result = await collection.Insert(itemList).ToList();
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

Takes 1091672 ms ~18 minutes


Inserting 100K documents without any parallesim

async Task InsertSequentialAsync()
{
    Console.WriteLine("---InsertSequentialAsync");
    var cluster = await Cluster.ConnectAsync($"couchbase://localhost", "Administrator", "111111");
    var collection = (await cluster.BucketAsync("Test")).DefaultCollection();
    var itemList = System.Text.Json.JsonSerializer.Deserialize<List<ProductItem>>(System.IO.File.ReadAllText("product_100k.json"));
    var sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    foreach(var item in itemList)
    {
        await collection.InsertAsync(Guid.NewGuid().ToString(), item);
    }
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

Takes 1209179 ms ~20 minutes

So, i couldn’t find a performant way to insert 100K document into CB from my single machine client. There should be a way to benefit from concurrency since this is a server system.

Hello,

  1. Remember that there is a big difference between doing things with say for example the Parallel Task library which is designed to help you compute things via multiple cores and threads vs doing heavy I/O tasks. These are not the same pattern. You are looking to do a high amount of I/O - network I/O to be more exact and adding more threads and CPU power doesn’t always equal better performance with network I/O.
  2. I have created a very simple example, but the concept might be able to help you. I created a load test using NBomber on GitHub to show some examples of loading a lot of records:
    GitHub - biozal/cb-dotnet-load-test: Loading Testing Couchbase .NET SDK using NBomber

This screenshot is from my computer and if you look you can see I was able to get around 1,450 ops/sec:

You’ll note I’m playing with a lot of performance numbers and when you start looking at doing high performant things, you will need to look into some of these. For example the number of key-value service connections:
options.NumKvConnections = 4;
options.MaxKvConnections = 8;

https://docs.couchbase.com/sdk-api/couchbase-net-client/api/Couchbase.ClusterOptions.html#Couchbase_ClusterOptions_NumKvConnections

Also, I tweaked the MaximumRetainedOperationBuilders:
https://docs.couchbase.com/sdk-api/couchbase-net-client/api/Couchbase.TuningOptions.html

Those numbers I just guessed on and you would want an architect to provide numbers based on the number of nodes, type of nodes, and setup of your cluster. The test is also throttled at around 900 inserts a second to simulate load on the run I took the screenshot from. You can play with the numbers to increase or decrease the load by editing the function parameters in the WithLoadSimulations call.

I’ve checked in small load numbers to get started with. This is because I’m running this on a Macbook Pro 13" with an Intel i5 processor and it’s a few years old by today’s standards. I barely use any memory and this is on a single node cluster running in docker. Obviously, with more nodes in the cluster and a machine that isn’t running a bunch of other things (I’m literally debugging an Android app on it at the moment and have about 20 other programs running), and a faster processor, I could get a lot better performance and probably increase the key-value service connections amount which would lower the time it takes.

I would note that making these numbers too high will result in a timeout - but it’s not networking time out, it’s running out of threads in the thread pool, and the time out is waiting for new threads. This is because DataFlowConnectionPool uses TPL, which uses ThreadPool. Other things using TPL or ThreadPool threads (such as Parallel.ForEach) running at their max parallelism result in an exhausted ThreadPool which NBomber uses for simulating the load. This was figured out by running:

dotnet-counters.exe monitor --name cb-dotnet-load-test --counters System.Runtime[threadpool-queue-length,threadpool-thread-count,monitor-lock-contention-count],CouchbaseNetClient --maxTimeSeries 10000 --maxHistograms 1000

I was able to achieve 100,000 records in under a minute and this was me just playing on my laptop and a single node cluster running in Docker. It is very possible to get good numbers - it just requires tweaking some code and a bit more understanding of the API, which I admit our documentation doesn’t always point out some of the more advanced options.

Hope that helps in some way.

Thanks
Aaron

@sedat-eyuboglu

I just want to add a couple more points to @biozal’s excellent response above.

  • The operation time out for the SDK is currently from the time you initiate the operation, not the time the operation is sent over the wire on the network. This means CPU cost for serialization, thread pool scaling, and network send queue are all a factor in the timeout. So if you initiate 100k operations simultaneously they are likely to encounter a lot of timeouts because they won’t all complete within the operation timeout.
  • The recommended pattern for bulk operations is to limit your degree of parallelization (though the exact number is tricky and depends on many factors like CPU, network speed, network latency, nature of the data you’re posting, all of the configuration settings mentioned in the previous answer, and if your process is doing other work in parallel besides the bulk operation)
  • If you are on .NET 6, Parallel.ForEachAsync with a customized degree of parallelization is an easy method for implementing this control

I hope this helps.

1 Like

@biozal Thank you for your detailed info. I agree with you this not just the CPU and Memory. tcp pools, threads and much more thing can cause such result.

I tried to run your sample.

  • Clone the repo
  • Open it with Visual Studio
  • Disable <ImplicitUsings>enable</ImplicitUsings>, since with this is option, it is can not be build
    And get the following exception

    I could not give so musch time to debug it.

On the other hand i think, I found out why you loaded the data in a much more performant way.
In my new test code (following code), i load the 100K document in 134 seconds which is much more better than my previous tryings.

async Task InsertParallelWithForEachAsync()
{
    Console.WriteLine("---InsertParallelWithForEachAsync");
    var cluster = await Cluster.ConnectAsync($"couchbase://localhost", options=>
    {
        options.WithCredentials("Administrator", "111111");
        options.NumKvConnections = 32;
        options.MaxKvConnections = 64;
        options.KvSendQueueCapacity = 8096;
    });
    
    var collection = (await cluster.BucketAsync("Test")).DefaultCollection();
    var itemList = System.Text.Json.JsonSerializer.Deserialize<List<ProductItem>>(System.IO.File.ReadAllText("product_100k.json"));

    var sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    await Parallel.ForEachAsync<ProductItem>(itemList, new ParallelOptions()
    {
        MaxDegreeOfParallelism = 50
    }, 
    async (item, token) =>
    {
        await collection.InsertAsync(Guid.NewGuid().ToString(), item);
    });
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

As you can see i set the NumKvConnections, MaxKvConnections, KvSendQueueCapacity and also use the Paralles.ForEach with maxdegree option as @btburnett3 suggested.

I think you missed some think, in my first try my bucket was using Durability Level is : Major and persist to active but in your test the write operations are not durable. So for my test code, i changed my durability level to None which is default for a bucket.

With NONE durable write options i could achive 134 seconds for 100K document. I am sure that with some more trying i could get more performence.

But we need durable write. Since the data is critical we have to be sure that it is persisted to disk.

Without any change to code, i just changed the durability level of bucket to Major and persist to active
And the performance decreased dramatically. It is normal to get worst performence with a disk write.
But we don’t expect to see ~18 min VS ~134 sec

I thought that i would be able to set the Durability level on application side for just this operation. But as i know the durability level which is set on Server is base (min level) and we can not set a lower level durability than the server.

Please note that i also comment the NumKvConnections,MaxKvConnections,KvSendQueueCapacity alloving then to use default values, and could get better performance as belov.

var cluster = await Cluster.ConnectAsync($"couchbase://localhost", options=>
    {
        options.WithCredentials("Administrator", "111111");
        // options.NumKvConnections = 32;
        // options.MaxKvConnections = 64;
        // options.KvSendQueueCapacity = 8096;
    });

And could load 100K document in 80 seconds. But again with none durable writes. As for durable writes i couldn’t get a good performance.

“sorry, the forum don’t allow me to load second media” see the raply for screens.

So just changing the durability to durable write makes this sample ~17 times slower (ops/sec) than none durable one. Do you know how to avoid such difference?

With NONE durable write

With durable write

@sedat-eyuboglu I missed that part in your post. 17 times slower doesn’t seem right. When I worked on a demo last fall I didn’t have nearly that bad of performance when changing the durability level - although I was running a small 3 node cluster. For this test - how many nodes are in your cluster and what services are they running?

-Aaron

It is a local, single node instance and hosted by docker. The bucket is created just for this test.
Node servs Data, Eventing, Index, Query, Search services.
Disk is SSD
And cosider that when the durability is set to none, i could get results near to yours. I mean there should not be a problem with the installation. Ofcourse i can not be sure.

@sedat-eyuboglu

You are asking the SDK to have durability without having multiple nodes to confirm that Durability was met. Have you tried this code on a cluster with multiple nodes?

-Aaron

No, but i set it to persist just to active node. And as i know, vbuckets have just one active node. I mean in a multi node cluster there will be one active node for a key. It is okay that keys will be distributed to multiple nodes. If you say that having multiple nodes will balance the load. It is okay. Lets me test with multiple node.
I will try to create a multi node environment and test it.

I’ll look into this - I wrote the code on a Mac using VS for Mac. Obviously, there is something weird going on with VS For PC as I just tested it again in VS for Mac, Rider, and VS Code on my mac and it runs on all three. I’ll see if I can get my PC booted up later tonight and debug this.

-Aaron

@biozal

Hi,
I tried the code on a cluster. I executed the following code with different values for MaxDegreeOfParallelism

async Task InsertParallelWithForEachAsync()
{
    Console.WriteLine("---InsertParallelWithForEachAsync");
    var cluster = await Cluster.ConnectAsync($"couchbase://****", options=>
    {
        options.WithCredentials("****", "****");
        // options.NumKvConnections = 32;
        // options.MaxKvConnections = 64;
        // options.KvSendQueueCapacity = 8096;
    });
    
    var collection = (await cluster.BucketAsync("****")).DefaultCollection();
    var itemList = System.Text.Json.JsonSerializer.Deserialize<List<ProductItem>>(System.IO.File.ReadAllText("product_100k.json"));

    var sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    await Parallel.ForEachAsync<ProductItem>(itemList, new ParallelOptions()
    {
        MaxDegreeOfParallelism = 36
    }, 
    async (item, token) =>
    {
        await collection.InsertAsync(Guid.NewGuid().ToString(), item, options =>
        {
            options.Timeout(TimeSpan.FromSeconds(30));
        });
    });
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

Durability is Majority and persist to active
100K documents
Server Enterprise Edition 6.6.0 build 7909

Results
MaxDegreeOfParallelism = 50
Result = 369 sec.
timeout option of sdk = 2.5 sec.

MaxDegreeOfParallelism = 70
Result = 470 sec.

MaxDegreeOfParallelism = 12
Result = 500 sec.

MaxDegreeOfParallelism = 24
Result = 390 sec.

MaxDegreeOfParallelism = 36
Result = 392 sec.

Cluster

As you can see cluster was not empty and had already some workload other than this test.
The optimum options (for my environment ) are 2.5 sec. timeout (which is default for CB .net sdk), and MaxDegreeOfParallelism = 50, which takes 369 sec to insert 100K documents.
And this result is ~3x better than my local environment with one node.

Do you think this is a acceptable result for such a cluster? I would like to hear your comments.
Thank you.

@sedat-eyuboglu,

I can’t comment if these performance numbers are good because hardware and disk operations could be different based on hardware. Again there is a big difference in running a 3 node cluster on AWS’s lowest hardware vs its largest hardware. For Enterprise customers normally they would ask there Solution Architect assigned to the account to review and make recommendations on the cluster configuration, etc.

If you are looking for performance numbers to compare against your cluster, I normally compare my numbers to Pillow Fight numbers. Pillow Fight is a performance testing tool that comes with libcouchbase:

I would say there are some limits on Windows (it can only run on a single thread), but on Mac and Linux you can really ramp it up and look at the performance numbers of your cluster. PIllowFight is written in C and it’s high performing. I recommend running this and comparing your numbers - it should give you a good idea what your cluster is capable of.

Thanks
Aaron

Thanks @biozal i ll be checking.