Bulk copy of existing documents

malclear · June 15, 2021, 7:03pm

I have a use-case where, once a month, I need to copy a set of documents to a new set of documents with slightly modified properties. This can be accomplished easily for a small set using a INSERT INTO - SELECT pattern. However, for large sets, running this statement causes the indexing service to get overloaded. I don’t mind the process taking a while and there are probably several ways to solve this, but I’m wondering what the best approach might be based only on Couchbase Server EE and/or the DotNet SDK.
If it will take an extended period of time, having feedback on the progress (for the purpose of updating a UI) would be very helpful too.
Are their any recommendations for this? The set may be as large as 1 million (small) documents.

Server EE, 6.6
.NET SDK 3.1

vsr1 · June 15, 2021, 7:56pm

Hi @malclear ,

Will you able to post your INSERT INTO SELECT query. If your SELET filter is simple eventing might be options cc @jon.strabala

Checkout Use Case 1:

https://docs.couchbase.com/server/current/eventing/eventing-example-data-enrichment.html

jon.strabala · June 15, 2021, 9:45pm

Hi @malclear,

Without know exactly what you want to modify or add, I propose a simple Eventing Function that can be deployed once a month. Note, you can use a REST API and cron to automate the deployment and eventual undeployment.

Eventing functions can process up to 1M doc’s per second depending on your cluster. Even small clusters can easily hit 50K docs/sec…

Below I show an update in place if you want to copy the doc you can modify the write key on the last line so something like src_bkt[meta.id + ":monthly"] = doc;. If you want to send to a different bucket then change the alias to dst_bkt and create a binding to the desired target bucket in the Function’s settings.

/*
Requires one bucket binding or alias
1. src_bkt is a r+w alias to the Function's source bucket
Note, for more performance set the the # workers to the # cores
*/
function OnUpdate(doc, meta) {
    // filter out all keys without prefix "workingtypes:"
    if (! meta.id.startsWith("workingtypes")) return;
    // modify and/or add something
    doc.monthly_prop_to_modify = Math.random();
    // update the src
    src_bkt[meta.id] = doc;
}

You have the full power of JavaScript and can make decisions based on the documents properties for example I could put in a test or filter to see if the doc has been updated already this month and just return.

The nice thing is that Eventing scales both vertically (add more workers) and horizontally (add more Eventing nodes) and you don’t have to deal with any special infrastructure since you are running a pure code fragment.

Best

Jon Strabala
Principal Product Manager - Server‌

malclear · June 23, 2021, 11:20pm

I’m trying something like what you suggested above, @jon.strabala , but I’m running into a problem. Here is a code sample:

function DuplicateStuff(oldSegment, newSegment){
	var myDetails = SELECT * FROM `spaghetti`  WHERE type = 'MyDetail' and segment = $oldSegment;
	    
	for(var item of myDetails){
	    var newItem = item;
	    newItem.segmentId = newSegment;
	    rcm["myDetail/" + newItem.id +"-"+ newSegment] = newItem;
	}
}

The “DuplicateStuff” function is called from the “OnUpdate” function.
The problem is that it breaks when I get to the for loop. Do you see any problem with me copying large sets of files this way?

The error is rather cryptic. It seems null reference related, but I’m not sure:

MyFunctionsName.js:48
D
-ffef-ff9f
^
null
(errors: 1)

jon.strabala · June 24, 2021, 12:30am

Hi @malclear,

I had hoped from my initial understanding that you would avoid N1QL entirely and do all your logic in Eventing’s JavaScript, but alas I don’t know the business logic requirements nor do I have

The entire function
Any sample data or docs

You are not crazy (yes that’s is not a helpful error) sometimes the error reporting is cryptic every release improves it. Furthermore, to debug you can add some log() statements and also use try/catch with a log() of the error if any. The try/catch log() should give an understandable message. Next you can use the JavaScript debugger built into chrome devtools if needed and you can step through your code.

I also took the liberty to do a bit of refactoring.

Next I sub selected the “item” form your iterator. var newItem = item.spaghetti;
I also closed your iterator myDetails.close();
Finally I added a try/catch block

// 1. bucket spaghetti
// 2. bucket meta
// 3. bucket binding to bucket "spaghetti" with an alias of "rcm" in read+write mode
//
// * sample KEY: MyDetail:10000  DOC: { "id": "1000", "type": "MyDetail", "segment": "old_s" }

function DuplicateStuff(oldSegment, newSegment){
    log('1a oldSegment',oldSegment);
    log('1b newSegment',newSegment);
    try {
	    var myDetails = SELECT * FROM `spaghetti`  WHERE type = 'MyDetail' and segment = $oldSegment;
	    var idx = 0;
	    for(var item of myDetails){
	        idx ++;
	        log('2a loop '+idx+' item',item)
	        // carfull you have a wild card in your SELECT '*' so our object is "spaghetti": { .... }
	        var newItem = item.spaghetti;
	        log('2b loop '+idx+' newItem',newItem)
	        newItem.segmentId = newSegment;
	        // need a bucket binding with an alias of rcm for the below to work 
	        var newKey = "myDetail/" + newItem.id +"-"+ newSegment;
	        log('2c loop '+idx+' newKey',newKey,' set to ', newItem);
	        rcm[newKey] = newItem;
	    }
	    // remember to close your iterator in 6.6
	    myDetails.close();
	    log('3 SUCCESS')
    } catch (e) {
        log('3 ERROR', e);
    }
}

function OnUpdate(doc,meta) {
    // FILTER you don't want to process a KEY prefix of "myDetail/"
    if (! meta.id.startsWith("MyDetail:")) return;
    DuplicateStuff(doc.segment, 10);
}

You also need an index to use N1QL. Basically a primary index allows you to only find KEYS any sort of N1QL DML statement would then always need to fetch the document to do filtering this is much less efficient than a purpose built index that satisfies a query without needed to dip into KV.

For a much deeper dive into the reasons please read the excellent blog post by @keshav_m Primary Uses for Couchbase Primary Index. | The Couchbase Blog

You can just run the following in the Query Workbench (warning it is fine for small toys BUT it is not efficient as a purpose built index would be - see above). Below is for pre 7.0.0 non-collections. Of course you most likely have an index already.

CREATE PRIMARY INDEX ON spaghetti;

Without any index you will have an error like

ERROR" {"message":"SDK error : LCB_ERR_PLANNING_FAILURE (401) Query error : {\n\"requestID\": \"b0232a8d-d92d-4f0a-b776-9d8a5a258d0f\",\n\"clientContextID\": \"12@bulk.js(DuplicateStuff)\",\n\"errors\": [{\"code\":4000,\"msg\":\"No index available on keyspace `default`:`spaghetti` that matches your query. Use CREATE PRIMARY INDEX ON `default`:`spaghetti` to create a primary index, or check that your expected index is online.\"}],\n\"status\": \"fatal\",\n\"metrics\": {\"elapsedTime\": \"1.039847ms\",\"executionTime\": \"986.75µs\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 1,\"errorCount\": 1}\n}\n","stack":"Error\n    at DuplicateStuff (bulk.js:12:22)\n    at OnUpdate (bulk.js:34:5)"}

I deploy the above function with no documents in spaghetti and

Then I add one (1) document as follows:

MyDetail:10000  
{
  "id": "1000",
  "type": "MyDetail",
  "segment": "old_s"
}

The output (reversed in time) from the Log button in the Eventing UI is below:

2021-06-23T17:22:05.783-07:00 [INFO] "3 SUCCESS" 
2021-06-23T17:22:05.783-07:00 [INFO] "2c loop 2 newKey" "myDetail/1000-10" " set to " {"id":"1000","type":"MyDetail","segment":"old_s","segmentId":10} 
2021-06-23T17:22:05.783-07:00 [INFO] "2b loop 2 newItem" {"id":"1000","type":"MyDetail","segment":"old_s","segmentId":10} 
2021-06-23T17:22:05.782-07:00 [INFO] "2c loop 1 newKey" "myDetail/1000-10" " set to " {"id":"1000","type":"MyDetail","segment":"old_s","segmentId":10} 
2021-06-23T17:22:05.782-07:00 [INFO] "2b loop 1 newItem" {"id":"1000","type":"MyDetail","segment":"old_s"} 
2021-06-23T17:22:05.782-07:00 [INFO] "2a loop 2 item" {"spaghetti":{"id":"1000","type":"MyDetail","segment":"old_s","segmentId":10}} 
2021-06-23T17:22:05.782-07:00 [INFO] "2a loop 1 item" {"spaghetti":{"id":"1000","type":"MyDetail","segment":"old_s"}} 
2021-06-23T17:22:05.764-07:00 [INFO] "1b newSegment" 10 
2021-06-23T17:22:05.764-07:00 [INFO] "1a oldSegment" "old_s"

And we get a new KEY/DOC as follows:

myDetail/1000-10
{
  "id": "1000",
  "type": "MyDetail",
  "segment": "old_s",
  "segmentId": 10
}

Okay now in this example I didn’t need to run N1QL at all (like I said I don’t know your business logic you might need N1QL). But in my mock up from my limited knowledge I could have just done the following without any N1QL:

// 1. bucket spaghetti
// 2. bucket meta
// 3. bucket binding to bucket "spaghetti" with an alias of "rcm" in read+write mode
//
// * sample KEY: MyDetail:10000  DOC: { "id": "1000", "type": "MyDetail", "segment": "old_s" }

function DuplicateStuff(doc, oldSegment, newSegment){
    log('1a oldSegment',oldSegment);
    log('1b newSegment',newSegment);
    try {
        var newKey = "myDetail/" + doc.id +"-"+ newSegment;
        doc.segmentId = newSegment;
        rcm[newKey] = doc;
	    log('3 SUCCESS')
    } catch (e) {
        log('3 ERROR', e);
    }
}

function OnUpdate(doc,meta) {
    // FILTER you don't want to process a KEY prefix of "myDetail/"
    if (! meta.id.startsWith("MyDetail:")) return;
    DuplicateStuff(doc, doc.segment, 10);
}

The second non N1QL function is so simple it should just all be don in OnUpdate() of course my log statements and refactoring clutter things up a bit too and can be removed. So in the third function below I am back to what I proposed at first:

// 1. bucket spaghetti
// 2. bucket meta
// 3. bucket binding to bucket "spaghetti" with an alias of "rcm" in read+write mode
//
// * sample KEY: MyDetail:10000  DOC: { "id": "1000", "type": "MyDetail", "segment": "old_s" }

function OnUpdate(doc,meta) {
    // FILTER you don't want to process a KEY prefix of "myDetail/"
    if (! meta.id.startsWith("MyDetail:")) return;
    try {
        var newSegment = 10;  // no idea where this comes from
        // add a new property to the working copy of doc
        doc.segmentId = newSegment;
        // save it back to the source bucket via the alias "rcm"
        rcm["myDetail/" + doc.id +"-"+ newSegment] = doc;
    } catch (e) {
        log('KEY',meta.id,'ERROR', e);
    }
}

When you drop down from pure KV in eventing to N1QL queries you will slow down and I doubt you will get any speedup using Eventing.

Eventing relies on DCP will replay ALL documents/mutations through your Eventing function (you are guaranteed to see the latest state). SO if you deploy your Function with the feed boundary is “From Everything” you will see all the documents, thus if you were querying documents from the same source bucket vi N1QL that was an expensive operation for something that you would just see anyway from the DCP stream.

Best

Jon Strabala
Principal Product Manager - Server‌

malclear · June 24, 2021, 3:43pm

Thank you so much for the in-depth response! I had every code-related thing you mentioned in place (in my actual code which I can’t share here), but the additional logging you suggested did help me find the problem: my script was timing out. A more thorough review of the settings available for Eventing functions was what I needed.

It would be great if the error raised due to a timeout was more prominent somehow.

jon.strabala · August 10, 2021, 11:21pm

Hi @malclear

It would be great if the error raised due to a timeout was more prominent somehow.

should have this in the next release.