Spark Connector Returns Different Result Than cbq Query

I have tested with different sizes of data. If the query only returns a small data set (several thousands), then it works fine and the numbers are correct. For larger numbers, there’s no error messages from the Spark log.

For Hive, normally I provide hive-site.xml and core-site.xml, then my spark app is able to access hive tables. However, to use Couchbase bucket, I have to add this line of code as stated in Couchbase documentation:

val cfg = new SparkConf()
.setAppName(“couchbaseQuickstart”) // give your app a name
.setMaster(“local[*]”) // set the master to local for easy experimenting
.set(“com.couchbase.bucket.travel-sample”, “”) // open the travel-sample bucket

[Edit]
To be precise, this is my code - I’m using a cloudera cluster under yarn mode, not using local master:

val sparkConf = new SparkConf().setAppName(config.getString(“appName”)).set(“com.couchbase.nodes”, config.getString(“couchbaseSeedIPs”)).set(“com.couchbase.bucket.” + config.getString(“couchbaseBucketName”), “”)

hive-site.xml and core-site.xml are provided to the spark shell as usual.
[/Edit]

As soon as this line was added, my spark app won’t be able to find hive tables anymore. Error message is (cannot find table “table name”). I believe it’s trying to find the table from the couchbase bucket.

So overall, I believe there are three very critical issues here:

  1. couchbase connector does not return the correct number of records if the dataset is relatively large
  2. when spark app is set to access couchbase bucket, it is not able to access hive anymore
  3. the design of sending all records into one spark executor through the connector. This, IMHO, is a series performance issue. My test couchbase cluster is powerful enough to support > 200K get ops/sec on my test dataset (and this number is only limited by the network bandwidth of the cluster). However, when using spark connector to retrieve the same dataset (which should obviously take advantage of batch get), the couchbase server only gets ~ 1K read/sec. And imagine millions of records get retrieved into one single executor of a spark cluster, then user have to repartition it manually – that’s a huge performance hit and could cause problem too.

Hope these information is helpful. Couchbase is a great product, and a good spark connector will bring so much more big data use cases to it.

Thanks