Querying and Processing a large dataset

Hello,

I’ve been tasked with migrating a large number of records out of one datacenter and into another. We cannot do a direct transfer (using cbtransfer) since there are some validations that need to occur. Therefore we need to run it though our micro service via a web service. Here is what I want to do:

  1. Extract the existing data from couch base instance using cbtransfer, which runs cb 3.1.3
  2. Load that data into a temporary bucket in one of our new data centers running cb 4.5
  3. Perform an N1QL query to get all the required data, run a transformation on each record (.map), and then call a POST on our web service to create the record in the new data center.

The problem is in step 3 since I am dealing with millions of records. I was under the impression that we could stream the N1QL result and the JVM memory would be controlled by back pressure. However from what I can see in the docs, regular Observables do not apply back-pressure. What are my options here? I read something about ViewQueries and Bulk Operations but I am hoping there is another option.

Thanks for your help,

-K

BTW, I am using the latest couchbase-java client.

Hi @k_reid, you can simply paginate to get the N number of documents per batch to control your back pressure. This isn’t the fastest way, but does goet the job done. This assumes the source data isn’t changing (or you don’t mind/care if the changes are happening white the transfer is in progress).

1. Create a primary index on the source bucket.
2. SELECT meta().id dockey, * from source_bucket where meta().id > "" ORDER BY meta().id OFFSET 0 LIMIT 1000;
2.1  INSERT all the docs to target.
2.2  Remember the last meta().id you received.
3. SELECT meta().id dockey, * from source_bucket where meta().id > "last-doc-key" ORDER BY meta().id 
4. repeat 2.1 through 3 until finish.

Hi @keshav_m,

Yes I thought about this approach after posting the question. I do have a
time requirement so can you suggest a quicker approach?

Thanks,

K

You can modify the query to ONLY get the META().id from the query and fetch the document directly from KV.

  1. SELECT META().id dockey FROM source_bucket WHERE meta().id > “” ORDER BY META().id
    Then get the documents from KV, remember the last meta().id.

This is definitely faster. You can receive LARGE set of document keys in order.

Similarly, after transformation, you can simply SET/insert the document on the target directly without having to issue an INSERT query. Because direct KV APIs avoid a hop, they’ll be much faster.

Thanks @keshav_m, but I actually went the streaming route. After some more research it turns out that can take advantage of reactive pull backpressure using streaming. I will post my findings after I run a large dataset.

@k_reid, That’s great. Please contact @czajkowski for writing an article on this on Couchbase blog. She’ll award you with a prize!

Hi we’d love to publish your article, you can find out more about how to get paid to write blog posts about your use of N1QL and Couchbase https://blog.couchbase.com/community-writing-program/
Thanks

Laura