Querying and Processing a large dataset

k_reid · October 13, 2017, 5:10pm

Hello,

I’ve been tasked with migrating a large number of records out of one datacenter and into another. We cannot do a direct transfer (using cbtransfer) since there are some validations that need to occur. Therefore we need to run it though our micro service via a web service. Here is what I want to do:

Extract the existing data from couch base instance using cbtransfer, which runs cb 3.1.3
Load that data into a temporary bucket in one of our new data centers running cb 4.5
Perform an N1QL query to get all the required data, run a transformation on each record (.map), and then call a POST on our web service to create the record in the new data center.

The problem is in step 3 since I am dealing with millions of records. I was under the impression that we could stream the N1QL result and the JVM memory would be controlled by back pressure. However from what I can see in the docs, regular Observables do not apply back-pressure. What are my options here? I read something about ViewQueries and Bulk Operations but I am hoping there is another option.

Thanks for your help,

-K

BTW, I am using the latest couchbase-java client.

keshav_m · October 14, 2017, 7:12pm

Hi @k_reid, you can simply paginate to get the N number of documents per batch to control your back pressure. This isn’t the fastest way, but does goet the job done. This assumes the source data isn’t changing (or you don’t mind/care if the changes are happening white the transfer is in progress).

1. Create a primary index on the source bucket.
2. SELECT meta().id dockey, * from source_bucket where meta().id > "" ORDER BY meta().id OFFSET 0 LIMIT 1000;
2.1  INSERT all the docs to target.
2.2  Remember the last meta().id you received.
3. SELECT meta().id dockey, * from source_bucket where meta().id > "last-doc-key" ORDER BY meta().id 
4. repeat 2.1 through 3 until finish.

k_reid · October 14, 2017, 7:25pm

Hi @keshav_m,

Yes I thought about this approach after posting the question. I do have a
time requirement so can you suggest a quicker approach?

Thanks,

K

keshav_m · October 15, 2017, 5:27am

You can modify the query to ONLY get the META().id from the query and fetch the document directly from KV.

SELECT META().id dockey FROM source_bucket WHERE meta().id > “” ORDER BY META().id
Then get the documents from KV, remember the last meta().id.

This is definitely faster. You can receive LARGE set of document keys in order.

Similarly, after transformation, you can simply SET/insert the document on the target directly without having to issue an INSERT query. Because direct KV APIs avoid a hop, they’ll be much faster.

k_reid · October 17, 2017, 8:41pm

Thanks @keshav_m, but I actually went the streaming route. After some more research it turns out that can take advantage of reactive pull backpressure using streaming. I will post my findings after I run a large dataset.

keshav_m · October 17, 2017, 9:15pm

@k_reid, That’s great. Please contact @czajkowski for writing an article on this on Couchbase blog. She’ll award you with a prize!

czajkowski · October 17, 2017, 9:26pm

Hi we’d love to publish your article, you can find out more about how to get paid to write blog posts about your use of N1QL and Couchbase https://blog.couchbase.com/community-writing-program/
Thanks

Laura

Topic		Replies	Views
Suggestion for doing bulk transaction Couchbase Server query , n1ql	4	963	December 28, 2021
Export of large numbers of documents by ad-hoc query SQL++	10	1952	August 9, 2020
How to save a large set of results from a n1ql query to a json file SQL++	4	2631	February 21, 2017
How to transfer documents from one bucket to another SQL++	16	9331	January 6, 2022
Bulk read and write using N1QL vs get/upsert SQL++	3	2632	January 26, 2016

Querying and Processing a large dataset

Related topics