On behalf of the whole team, I’m incredibly proud to announce that we’ve just released the first developer preview of our brand new Couchbase Spark Connector. It allows you to move data in and out of Couchbase directly using Spark RDDs. It turns out that Couchbase and Spark are a perfect fit because they share lots of properties like scalability, performance and ease of use.
The following features are available in this developer preview (with more to come soon):
- Creating RDDs from Documents, Views and N1QL Queries.
- Writing RDDs and DStreams into Couchbase.
- Fully Transparent cluster and bucket management, including direct access if needed.
You can find the project on github here. We also provide documentation in wiki form, but plan to move it to something more official once the project nears a GA release.
The developer previews are available through our own maven repository, the GA artifacts will be available on maven central. Here are the coordinates (we are cross compiling to scala 2.10 and 2.11 like spark does):
- Group ID: com.couchbase.client
- Artifact ID: spark-connector_2.10 or spark-connector_2.11
- Version: 1.0.0-dp
This developer preview has been built with Spark 1.2.1. If you just want to play around with a local installation, that’s all you need to get started. Once you need to deploy to production, the easiest is to “fat jar” all dependencies by using the great sbt-assembly plugin. Strictly speaking this doesn’t have much to do with the couchbase connector, but we are going to provide full guides in that area soon as part of the official docs.
Once everything is on the classpath, it is time to set up a configurtion and the spark context. The full configuration is done through properties on the spark configuration. If you do not provide any settings it will connect to a server on localhost and use the default bucket.
Here is a slightly customized config that opens the beer-sample bucket on a remote cluster. We are going to use the beer-sample dataset in most of the following examples.
It is also possible to open multiple buckets and use them at the same time, please refer to the current docs for more information. Now all we ned to do is initialize the spark context:
Reading from Couchbase into RDDs
The first thing to remember is to import the right namespace so all of the Couchbase specific methods are available on the spark context.
Now we can use the couchbaseGet method to read a sequence of IDs into an RDD and then optionally apply all kinds of spark transformations to them. Here is a simple one that loads beer documents, extracts their names and prints them:
The code uses the default parallelism, but an overloaded method is available to customize that property so you can get the right parallelism factor you need.
You need to hint the target document type to the SDK since there are many ways to convert it to. So if you want to use the RawJsonDocument instead of the JsonDocument you can get access to the raw json string instead of the converted JsonObject. In a similar fashion you can even get access to binary data you are storing in couchbase as documents.
You can also create an RDD from a view query. The following example prints the first 10 rows from the view result:
Very often you also need to grab the full document for each row emitted. To make that happen, the couchbaseGet is also available as an RDD transformation. Here is a more complete example which loads all documents from a given view and caches the RDD. Then we are calculating the average alcohol by volume for all beers as well as find the beers with the longest name.
Finally, if you have a N1QL enabled server (or running at least N1QL DP4) you can run a N1QL query as well:
Writing RDDs into Couchbase
Couchbase is well known for its excellent write performance, so it would be a shame if we couldn’t utilize that in spark. The simplest way is to pass a RDD[Document] into the saveToCouchbase RDD method. The following code will create 100 documents in Couchbase:
You can also make use of custom converters which reduce the boilerplate if full control over the created document is not needed. There are some limitations at this point, but flat JSON structures can be stored like this as well (similar to the above):
We are going to enhance the conversion capabilities in the next developer previews, as well as allowing you to hook in your own logic.
Finally, we also provide support to firehose data into couchbase from a DStream (spark streaming). Here is a complete example which fetches data from twitter and stores popular hashtags as documents in Couchbase:
Plans for Developer Preview 2
As exciting as this first developer preview is, we have many more features in the queue. For example, we are working on tighter N1QL and Spark SQL integration, creating DStreams through DCP (our internal streaming protocol) to get real time document changes into spark and also custom converters.
Please provide feedback and ask questions through our forums and post any issues you find to our bug tracker (you can also file wanted features and enhancements there!). We are super excited where this project is moving and getting feedback from early adoperts is crucial to make it even more awesome.
Would there be anything to be installed on each node of the Couchbase cluster for this to work?