Leveraging pyspark to write to couchbase

Hi,

I have a json file with around 6mn enteries. I am using java sdk for writing to couchbase, but writing one entry at a time is taking hours. I am fairly new to couchbase and wanted to know what is best way for speeding up the write to db. Would leveraging spark help in this case. Do you have any sample programs for pyspark and couchbase. Any other suggestions would also be really great

If the source data is already valid JSON as indicated, you may be able to use the cbimport tool to more efficiently load it.

Ref: cbimport | Couchbase Docs

I’d typically use something along the lines of:

cbimport json --format list -c http://<host>:8091 -u <user> -p <pwd> -b <bucket> -g "#UUID#" -d file:///path/to/data.json

Where I want the document keys to be generated as UUIDs, and the data is a JSON array of objects (“list” format), e.g.

[
{"field":"value","another":"value"},
{"field":"value","another":"value"},
...
]

HTH.

Hi @dh

Thank you for the reply. My data get updated frequently (every week) and have some other data which gets updated on daily basis. So I need to push the data into cb as part of a script instead of using cbimport to pull the data into couchbase. Also the records need to updated/overwritten

Is probably a good place for you to start and attempt to ingest and submit a batch of JSON documents from your file at a time, rather than one at a time.

HTH.