Leveraging pyspark to write to couchbase

varunkumthekar · April 19, 2022, 10:29pm

Hi,

I have a json file with around 6mn enteries. I am using java sdk for writing to couchbase, but writing one entry at a time is taking hours. I am fairly new to couchbase and wanted to know what is best way for speeding up the write to db. Would leveraging spark help in this case. Do you have any sample programs for pyspark and couchbase. Any other suggestions would also be really great

dh · April 20, 2022, 10:02am

If the source data is already valid JSON as indicated, you may be able to use the cbimport tool to more efficiently load it.

Ref: cbimport | Couchbase Docs

I’d typically use something along the lines of:

cbimport json --format list -c http://<host>:8091 -u <user> -p <pwd> -b <bucket> -g "#UUID#" -d file:///path/to/data.json

Where I want the document keys to be generated as UUIDs, and the data is a JSON array of objects (“list” format), e.g.

[
{"field":"value","another":"value"},
{"field":"value","another":"value"},
...
]

HTH.

varunkumthekar · April 20, 2022, 2:43pm

Hi @dh

Thank you for the reply. My data get updated frequently (every week) and have some other data which gets updated on daily basis. So I need to push the data into cb as part of a script instead of using cbimport to pull the data into couchbase. Also the records need to updated/overwritten

dh · April 21, 2022, 11:10am

Is probably a good place for you to start and attempt to ingest and submit a batch of JSON documents from your file at a time, rather than one at a time.

HTH.

Topic		Replies	Views
How to insert big json file into couchbasedb using spark connector Spark Connector spark	2	1164	April 5, 2022
Write a Spark DataFrame to Couchbase Spark Connector spark , connections , java	1	2401	October 19, 2017
Couchbase spark bulk insert Spark Connector	6	3530	April 23, 2018
Couchbase Spark Connector Java Streaming Spark Connector	11	3909	July 11, 2016
Does the Couchbase Spark Connector having support for PySpark? Spark Connector spark	0	2287	August 4, 2017

Leveraging pyspark to write to couchbase

Related topics