Batch reading data - controlling vbuckets

marcin.szymaniuk · June 21, 2018, 9:08am

I would like to read a significant portion of Couchbase by Spark. What I would like to achieve is to do the read in a reasonable way - instead of millions of random reads I would like to control which vBucket/Server/file I’m reading the data from. I know there is a Spark connector but there are many complaints about lack of control like that which causes bad performance.
So my question is - does Couchbase allow for operations like:

read entire vBucket (preferable since I believe that the entire vbucket is not only on the same server but also in the single place storage-wise).
or:
any other way of using indexes which will allow efficient batch reads (i.e. clustered index in relational databases allows to read data per range in an effective way).

Thanks,
Marcin

matthew.groves · June 21, 2018, 2:46pm

Hi @marcin.szymaniuk,

I don’t think there is any interface (at least public interface) that allows you to do any vbucket-centric operation.

Are the Spark connector complaints yours? You may want to start a separate thread in the Spark connector forum to address them: Spark Connector - Couchbase Forums or at least make @tyler.mitchell aware of them (if he isn’t already).

The closest I can think of to what you’re asking is the new index partitioning feature in Couchbase Server 5.5. This allows you to do “partition elimination” when executing N1QL queries. This means you can narrow your index down to a single partition (kinda like a vbucket, but for indexes). But it will still be gathering documents from multiple vbuckets. More information here: Index Partitioning in Couchbase Server 5.5 - The Couchbase Blog

marcin.szymaniuk · June 28, 2018, 3:22pm

Thanks @matthew.groves

The complaints are not mine, I just mentioned them because I want to use other’s experience and not end up in a blind alley.

I will try the Spark connector forum according to your suggestion.

Thanks for interesting link!

Topic		Replies	Views
Parallel read from a bucket Couchbase Server	8	1693	March 22, 2019
Reads from multiple buckets? Couchbase Server	1	1675	July 24, 2013
How to bulk read the data from couchbase in spark? Java SDK spark , n1ql	1	2171	December 10, 2018
spark.sparkContext.couchbaseQuery number of partitions Spark Connector	8	4810	September 30, 2020
When connect to more than one bucket the reads are slowing down (number of gets/sec) Spark Connector	0	1570	January 23, 2018

Batch reading data - controlling vbuckets

Related topics