Large scale query-multithreading and batching

krishnaa · March 16, 2021, 1:11pm

We have to set up a batch process that runs a query to delete a specific portion of couchbase docs (More than 35k docs at once). We have been told to use multi-threading and use 3.x version of .net sdk. Had read that batch processing is supported only in 2.7 .Can anyone suggest any pointers here for multithreading and batchprocessing ?

btburnett3 · March 16, 2021, 2:39pm

@krishnaa

It is supported, but via async Tasks, in 3.x. Basically, just run off a bunch of tasks and then await Task.WhenAll(tasks).

However, this approach does add a few complexities. For example, exceptions if any of the tasks fail are difficult to extract clearly. Also, if the batch size is very large (as yours is) it can actually cause performance issues if you don’t control the degree of parallelism.

I’ve been working on a library to help address this. It’s on GitHub, but not published to NuGet yet. I’d love it if you wanted to give it a try and give feedback on performance and API surface. Implement Couchbase.Extensions.MultiOp by brantburnett · Pull Request #92 · couchbaselabs/Couchbase.Extensions · GitHub

You should be able to pull the source from that branch and build it. Instructions for use are here: Couchbase.Extensions/docs/multi-op.md at 95b12bfcb0fee84281ae9c556d429c9c133185d0 · couchbaselabs/Couchbase.Extensions · GitHub

krishnaa · March 18, 2021, 12:55pm

@btburnett3
My requirement is to delete a specific node from all the couchbase documents. So number of couchbase documents that specific node consists might vary from 10 couchbase docs to 35k couchbase documents.
I can fetch the count of documents from which the a specific node has to be deleted.

So how can I use await Task.WhenAll(tasks) so that I can run a bunch off tasks?. How can I make sure the first task is not overwriting or repeating the things done by second task because I will not have the list of document Ids from which the node has to be deleted and document count is huge .

btburnett3 · March 18, 2021, 1:45pm

If you don’t have the list of document keys what’s your plan to know which documents to mutate? Are you just going through all documents in the bucket and inspecting them one by one?

krishnaa · March 18, 2021, 1:49pm

@btburnett3
Currently doing it via this N1QL query and have added the index for nested array
UPDATE Test AS d
SET d.children= ARRAY l FOR l IN d.children WHEN l.childId != “123456” END
WHERE ANY v IN d.children SATISFIES v.childId = “123456” END;

CREATE INDEX ix1 ON Test (DISTINCT ARRAY v.childId FOR v IN children END);

But wanted a better way of achieving this which improves performance

btburnett3 · March 18, 2021, 5:00pm

What’s the performance like running that query just as a SELECT query to get the document keys? If you had that list, then you could use the keys to spool off the mutations.

That said, depending on your goals, I’m not sure if it would improve performance. That’s basically what the query node is already doing for you. Seems like your major problem isn’t the mutations, but how you’re identifying documents which need to be mutated.

Topic		Replies	Views
Large scale query with multi -threading and batch streaming Java SDK query , n1ql	1	1228	August 16, 2018
Bulk get per node .NET SDK get	2	1549	April 29, 2019
.NET SDK Performance .NET SDK	12	2845	November 3, 2020
Batch processing in SDK 3.2.x Java SDK	1	613	March 1, 2022
Latency in executing queries from a messaging queue (Couchbase SDK 3.x) after package upgrade Couchbase Server query , n1ql , dot-net	4	252	July 8, 2024

Large scale query-multithreading and batching

Related topics