Upgrade Spark Couchbase Connector to Java Client version 3.0.0+

Hi!

Is there any effort channeled to upgrade the Java Client within the Spark Couchbase Connector to version 3.0.0?

Update: I have just checked the complexity of upgrading to Core IO 2.0.0. It’s messy. A lot of things have changed. Upgrading to Java Client 3.0.0 (Scala client for that matter) is blocked since we rely on the Couchbase Spark Connector in our project as well. Upgrading to DCP client 0.28.0 and Spark 3.0.0 release is easy, however, and all tests succeed locally. See [1].

[1] https://github.com/enlivensystems/couchbase-spark-connector

Thanks,
Zoltán

Hi Zoltán,

It is something we want to get to. In fact we’d like to rewrite it to use the Scala SDK that’s now available. But I have to be honest and say that it is not really on a confirmed roadmap at present.
I’d be very interested in seeing your 0.28.0 and Spark 3.0.0 changes as contributions, so we can attribute that work to you. (There’s a CONTRIBUTING.md guide in the repo)

regards,
Graham

1 Like

Hi Graham!

Thanks for an update on this issue!

I have contributed in the past to the DCP library. There, I personally found it cumbersome to contribute using Gerrit, but I would be happy if you could pull my work from [1].

In addition to these changes, I made an attempt to shade (shadow/relocate) the Core IO and Java Client libraries in the Couchbase Spark JAR that is authored through Gradle publish. I observed that not all references to that shadowed classes were correctly relocated. To me, the effect seemed weird, since relocating classes is not rocket science, moreover, some references were indeed relocated and some others are not, all of which belong to the same package. I suspect that something is not working quite right with Gradle’s shadow plugin.

Nevertheless, I upgraded the whole project to an SBT one, where I used the SBT assembly plugin in order to attempt a class relocation, again. There, relocation failed again. I noticed that SBT assembly and Gradle shadow uses Jar Jar Links, which have some limitations on relocations when rewriting import in Scala classes.

I’ll make further attempts to shade Core IS and Java Client libraries. That would mitigate the problem of conflicting Core IO and Java Client libraries when we want to depend on newer versions of Couchbase Java packages.

[1] https://github.com/enlivensystems/couchbase-spark-connector

Thanks,
Zoltán

2 Likes

Hi Zoltán,

On the shadowing - I have some memories that problems with the shadowing plugin was the reason we had to abandon our experiment of using Gradle for couchbase-jvm-clients in the end, and return to Maven. @daschl do you recall if that was the case?

Side note - that said, I’ve also had problems with Maven’s shadowing plugin when using it with Scala. Perhaps that also comes down to this same Jar Jar Links issue? So if we were to look at a shadowing approach, I’d be concerned about it not working when we did get to rewriting the connector atop the Scala SDK. (Though the need for shadowing would also be much reduced at that point too.)

I’m sorry that you find the gerrit contribution process cumbersome. I can take a look at your changes and see if I can add them to the Spark Connector, and credit you in the notes. Out of interest, and because we are always keen to find ways of encouraging community contributions, would you submit it if we had a Github PR process instead?

regards,
Graham

1 Like

Yeah we ran into a couple issues with shadowing (and renaming, assets etc), that’s why we stuck with maven for now. It seems to do the job good enough, even though its also not perfect.

1 Like

It all comes down to Jar Jar Links. Since SBT assembly is moved to Jar Jar Abrams [2], it now fixes the shading issues around Scala classes that are affected in the Couchbase Spark Connector as well. Gradle, as I have mentioned was a no go. Maven did not work either.

For this reason, we upgraded the build model to SBT. [1] It sucessfully shades the old libraries and we upgraded our clients to Scala version 1.0.5. This is how we depend on it currently. (Resolver is not public.)

val couchbaseSpark = ("com.couchbase.client" %% "spark-connector" % "3.0.2")
    .classifier("shaded")
    .excludeAll(
      ExclusionRule("org.apache.spark")
    )

Highlights:

  • DCP version 0.28.0.
  • Java client 2.7.15.

Shade rules in the SBT build model of Spark Couchbase Connector:


  assemblyShadeRules in assembly := Seq(
    ShadeRule
      .rename("com.couchbase.client.java.**" -> "shaded.com.couchbase.client.java.@1")
      .inLibrary("com.couchbase.client" % "java-client" % "2.7.15")
      .inProject,
    ShadeRule
      .rename("com.couchbase.client.core.**" -> "shaded.com.couchbase.client.core.@1")
      .inAll,
    ShadeRule
      .rename("com.couchbase.client.encryption.**" -> "shaded.com.couchbase.client.encryption.@1")
      .inAll
  ),

Tests go through, including our own tests in all of our child projects.

Out of interest, and because we are always keen to find ways of encouraging community contributions, would you submit it if we had a Github PR process instead?

Yes, definitely.

P.S.: The new Java & Scala library design is really good.

[1] GitHub - enlivensystems/couchbase-spark-connector: The Official Couchbase Spark Connector
[2] switch to jar jar abrams 0.1.0 by eed3si9n · Pull Request #398 · sbt/sbt-assembly · GitHub

@zoltan.zvara very interesting, and thanks for sharing this in-depth. This a approach (a move to SBT, plus the shadowing) could be a promising direction for the eventual major connector upgrade. As mentioned I don’t have a roadmap for that currently - are you happy to continue with your fork in the interim?

I wasn’t aware of the Jar Jar Links vs Abrams thing (loving the names), that certainly explains the Scala shadowing pain we’ve seen. Unfortunately it also sounds like we won’t be able to resolve those pains in couchbase-jvm-clients with our current Maven setup anytime soon.

P.S.: The new Java & Scala library design is really good.

Thank you!

Hi!

One of my friends noticed me about a new push to the repository. [1]

@daschl is this ready to be tested?

[1] GitHub - couchbase/couchbase-spark-connector: The Official Couchbase Spark Connector

@zoltan.zvara
@daschl is out of office this week.
The Spark conversion project is currently work-in-progress, with the work done so far pushed as we prefer to ‘work in the open’ so the community can see where we’re leaning. If you want to check it out and provide feedback then please feel free, but it is way pre-alpha so of course expect plenty of issues.

1 Like

We tried out the new Spark connector and we noticed that it may not be so well suited for Big Data applications, where a SQL query result may not fit into memory or may be too slow.

We also noticed that it is not utilizing the Java DCP, which should be preferred for the following reasons

  • large SQL batch reads that would not fit into the memory of a single node, could instead open ad-hoc DCP sessions to read data into Spark partitions,
  • the Spark Streaming (CouchbaseReceiver) option has been removed, although it could be simplified with highlevel package of DCP more easily.

Are there any plans on the above two points for Couchbase? In case not, but could be useful, I think we could help. We have an implementation for both that we use in production.

@zoltan.zvara we did not bring the streaming aspects to spark 3 since they were marked as experimental and need good reconsideration before bringing them back. I agree it is very useful, and we are thinking about how to integrate it well (similar to our kafka or elasticsearch connectors that also use DCP under the covers).