Spark Connector Returns Different Result Than cbq Query

setting --conf spark.executor.extraJavaOptions=-Dcom.couchbase.queryTimeout=360000 also solved my problem where the actual number of documents loaded was different from what we know. But this is more or less temporary solution since the number of docs can go up dramatically and then we have to reconfigure this parameter. Any other suggestions?

Also it would be nice to throw timeout error if timeout happens, otherwise we really don’t know whether we have right data or not. Any idea?

To increase the parallelism, I first load all meta id from couchbase and repartition it. Then, I load the full data from couchbase. By this, I could fully use my cluster resource. You can find the code here: spark.sparkContext.couchbaseQuery number of partitions