Spark Connector Returns Different Result Than cbq Query

hclc · July 11, 2016, 6:21pm

I have tested with different sizes of data. If the query only returns a small data set (several thousands), then it works fine and the numbers are correct. For larger numbers, there’s no error messages from the Spark log.

For Hive, normally I provide hive-site.xml and core-site.xml, then my spark app is able to access hive tables. However, to use Couchbase bucket, I have to add this line of code as stated in Couchbase documentation:

val cfg = new SparkConf()
.setAppName(“couchbaseQuickstart”) // give your app a name
.setMaster(“local[*]”) // set the master to local for easy experimenting
.set(“com.couchbase.bucket.travel-sample”, “”) // open the travel-sample bucket

[Edit]
To be precise, this is my code - I’m using a cloudera cluster under yarn mode, not using local master:

val sparkConf = new SparkConf().setAppName(config.getString(“appName”)).set(“com.couchbase.nodes”, config.getString(“couchbaseSeedIPs”)).set(“com.couchbase.bucket.” + config.getString(“couchbaseBucketName”), “”)

hive-site.xml and core-site.xml are provided to the spark shell as usual.
[/Edit]

As soon as this line was added, my spark app won’t be able to find hive tables anymore. Error message is (cannot find table “table name”). I believe it’s trying to find the table from the couchbase bucket.

So overall, I believe there are three very critical issues here:

couchbase connector does not return the correct number of records if the dataset is relatively large
when spark app is set to access couchbase bucket, it is not able to access hive anymore
the design of sending all records into one spark executor through the connector. This, IMHO, is a series performance issue. My test couchbase cluster is powerful enough to support > 200K get ops/sec on my test dataset (and this number is only limited by the network bandwidth of the cluster). However, when using spark connector to retrieve the same dataset (which should obviously take advantage of batch get), the couchbase server only gets ~ 1K read/sec. And imagine millions of records get retrieved into one single executor of a spark cluster, then user have to repartition it manually – that’s a huge performance hit and could cause problem too.

Hope these information is helpful. Couchbase is a great product, and a good spark connector will bring so much more big data use cases to it.

Thanks

Topic		Replies	Views
Spark Connector: No results returned when invoking sqlContext.read.couchbase() Spark Connector	6	2904	April 26, 2016
Dataframe count different after persist Spark Connector	0	3275	August 4, 2017
Unable to Count the documents using spark structured streaming Couchbase Server spark	0	1531	April 14, 2020
Couchbase GUI showing incorrect info Couchbase Server	4	1033	November 2, 2017
Spark connector couchbase using java API, N1QL LIMIT option Spark Connector	8	565	October 15, 2023

Spark Connector Returns Different Result Than cbq Query

Related topics