It’s been a while since we’ve introduced the first developer preview of our brand new Couchbase Spark Connector. So we thought it’s time for another release, providing enhancements and a handful of new features. In particular:
- Native Support for Spark SQL
- Native Support for Spark Streaming
- Preferred Locations for Key/Value Access
- Dependency Upgrades
Most importantly, we now use Apache Spark 1.3, which will be the target version for the GA release of the connector.
If you want to dive in right now, here are the coordinates:
- Group ID: com.couchbase.client
- Artifact ID: spark-connector_2.10 or spark-connector_2.11
- Version: 1.0.0-dp2
Make sure to include the Couchbase Repository since it is a pre-release:
In addition to the plain maven dependency, we are now also available on spark-packages.org!
Spark SQL Support
The main reason for upgrading to Spark 1.3 is the support for the new DataFrame API. The new DataFrame API is based on SQL and allows us to tightly integrate couchbase’s language specific implementation, N1QL. This provides a seamless end to end experience thats both familiar and tightly integrated.
The connector allows you to define relations (Couchbase itself does not enforce a schema) and you can either provide a schema manually or you can let the connector infer it for you. If you want automatic schema inference, it is a good idea to provide some kind of filter (for example on a type field) to get better accuracy. It will transparently load sample documents and create a schema for you which you can then query.
Here is an example with the current travel-sample:
This will first print you a schema like:
And then the results in the format of:
While this is great to work with, the real power comes when you want to combine DataFrames from different sources. The following example fetches data from HDFS and Couchbase and queries it together:
Note that currently nested JSON objects and arrays are not supported, but will be in the next version. They can show up in the schema, but will fail during query time (when you want to query one of those fields or include in the result).
Spark Streaming Support
The first developer preview already provided support for writing data from a Spark Stream into Couchbase. In this developer preview, we’ve added experimental support for utilizing our “DCP” protocol to feed mutations and snapshot information as quickly as possible into spark. You can use this for near real-time analysis of data as it arrives in your Couchbase Cluster.
Note that currently snapshot and rollback support are not implemented, so failover and rebalance will not work as expected.
Here is a simple example on how to feed changes from your bucket:
That’s all you need to do at this point. Make sure to test it on an empty bucket, because currently it will provide you all data in the bucket as well. Capabilities to start at the current snapshot will be added in a future release. If you run the code and then write documents into the bucket Spark Streaming will notify you of those changes:
You can then filter for mutations and perform any kind of stream operation on them, including feeding them into a machine learning algorithm or storing in a different datastore after processing (or even back into couchbase).
Preferred Locations for Key/Value Access
One of performance killers when using Spark are shuffle operations. A “shuffle” is an operation which involves transferring data from one worker to another over the network. Since network traffic is always orders of magnitude slower than in-memory processing, shuffling needs to be avoided as much as possible.
When you are loading documents through their unique IDs, the connector now always (and transparently) hints to Spark where the document is located in Couchbase. As a result, if you deploy a Spark Worker on every Couchbase Node, Spark will be intelligent enough to dispatch the task directly to this worker, removing the need to transfer the document over the network and allowing for local processing.
Since the location is just a hint, you can still run your workers on different nodes and Spark will dispatch them properly if it can’t find a perfect match.
In the future, we also plan preferred locations for N1QL and View lookups based on the target node where the query will be executed.
Planned Features And Next Steps
While this release brings us closer to completion, we still have many things to complete. The next milestone will be a beta release and will include features like the java API, enhancements to Spark SQL and Streaming support and of course fixes and stability improvements as we find them.
Since we haven’t reached beta yet, please provide as much feedback as possible while we can change the API as needed.