Couchbase is the world’s leading NoSQL document database. It offers unmatched performance, flexibility and scalability on the edge, on-premise and in the cloud. Spark is one of the most popular in-memory computing environments. The two platforms can be combined to execute blazingly fast query, data engineering, data science and machine learning functions.

In this QuickStart, I will guide you through the simple steps to set up Couchbase with Databricks* and run Couchbase data queries and Spark SQL queries.

*Note: The steps in this QuickStart have been validated against Databricks runtime 10.4 LTS.

Setup

Prerequisites

To complete this QuickStart, you will need the following:

    • A Couchbase cluster and travel-sample bucket accessible to the Databricks cluster. I used a Couchbase cluster on an AWS EC2 machine.
    • A Databricks account – free trials that require an AWS, Azure, or GCP account are available.
    • The Couchbase spark-connector library, version 3.2.2 – available via Maven
      • In the cluster creation screen under the Libraries tab.  Select Install new and search for the package on Maven Central.  See the example below:

    • The Install library setting will be configured as in the example below:


Configuration

Before we begin, we need to configure the following parameters in the Databricks cluster advanced options Spark config. This can be done when you create a cluster (please see screen print below):

You can copy and paste the settings below and replace parameters in <> with the values for your Couchbase cluster in the advanced options Spark config

First, let’s run the necessary imports. Copy the sample code below to a blank notebook attached to a cluster with the configuration above

Now, let’s get some documents by keys from the Couchbase travel-sample database using the code below:

Great, we have connected to the cluster and returned our first RDD (Resilient Distributed Dataset).

We can query the data using SQL++ (Couchbase Query language based on SQL).  Run the code below as an example:

Analytics Service Query

Couchbase also offers an Analytics service for operational analytics and real-time analytics below is an example of an analytics query:

Now on to some Spark SQL

Use the code below to create temp views for airlines and airports DataFrames:

We can now run Spark SQL queries on the views, for example:

Get airlines in ascending order:

Get airlines grouped by country:

And finally, let’s visualize the airports per country using a UDF (User Defined Function) along with the Databricks mapping feature.  Create the UDF using the SQL++ below:

Select the airport counts by country and visualize the results:

After completing this Quickstart, your result should be similar to the visualization below:

What we have accomplished

In this QuickStart, I have outlined how to utilize the Couchbase spark-connector with Databricks to create RDDs, run Couchbase and Spark SQL queries, create a UDF, and utilize the Databricks mapping feature to visualize the results. These steps demonstrate the process used to access, analyze and visualize data in a Couchbase cluster from a Databricks notebook interface.

Next steps

Learn more about Couchbase Capella:

Thank you for reading this post! If you have any questions or comments, please connect with us on the Couchbase Forums!

 

 

 

 

 

 

 

 

 

 

 

Author

Posted by Rick Jacobs

Rick Jacobs is the Technical Product Marketing Manager at Couchbase. His varied background includes experience at many of the world’s leading organizations such as Computer Sciences Corporation, IBM, Cloudera etc. He comes with over 15 years of general technology experience garnered from serving in development, consulting, data science, sales engineering and technical marketing roles. He holds several academic degrees including an MS in Computational Science from George Mason University.

Leave a reply