Products
- - - Platform
      Couchbase CapellaDatabase-as-a-Service
    - Services
      AI Data PlaneProduction AI agent data layer
      
      SearchFull-text, hybrid, geospatial, vector
      
      MobileEmbedded database, cloud to edge sync, peer-to-peer sync
      
      AnalyticsReal-time, multisource analytics
  - - Self-Managed
      Couchbase ServerOn-prem, multicloud, community
    - Capabilities
      In-memory ArchitectureSpeed, scale, availability
      
      Build Flexible AppsJSON, SQL++, multipurpose
      
      Cloud AutomationKubernetes Operator
      
      Dev ToolsSDKs, integrations, Capella iQ
      
      Couchbase Edge ServerFor resource-constrained environments
Solutions
- - - By Use Case
      Artificial Intelligence
      
      Caching and Session Management
      
      Field Services
      
      Product Catalog
      
      Real-Time Analytics for AI
      
      Smart Personalization & Profiles
      
      See all use cases
  - - By Industry
      Financial Services
      
      Healthcare
      
      High Tech
      
      Media & Entertainment
      
      Retail
      
      Telecommunications
      
      Travel & Hospitality
      
      See all industries
Resources
- - - Popular Docs
      Capella Overview
      
      Server Overview
      
      Mobile & Edge Overview
      
      Connecting Apps (SDKs)
      
      Tutorials & Samples
      
      Docs Home
  - - Quickstart
      Blog
      
      Case Studies
      
      Developer Portal
      
      Forums
      
      Training & Certification
      
      Webcasts & Events
- - - Resource Center
      
      View all Couchbase resources in one place
      
      Check it out
Company
- - - About
      About us
      
      Leadership
      
      Customers
      
      Why Couchbase
      
      Blog
      
      Newsroom
      
      Careers
  - - Partnerships
      Find a Partner
      
      Become a Partner
      
      Register a Deal
Pricing
Search
Korean
Sign in
Try Free

블로그 홈

Company
Engineering
Artificial Intelligence (AI)
Capella
Mobile
Analytics
AI Services
Application Design
Architecture
Best Practices and Tutorials
Community
Connectors
Cross Data Center Replication (XDCR)
Customers
Data Modeling
Features
Generative AI (GenAI)
Multi-Dimensional Scaling (MDS)
Partners
Performance
Security
SQL++ / N1QL Query
Tools & SDKs

블로그 로그인

Uncategorized

Couchbase Spark Connector 1.0.0 Released

Michael Nitschinger

10월 26, 2015

6 분 읽기

Spark Connector 1.0.0 Released

After two developer previews and one beta I’m super happy to announce the first stable release of our Couchbase Spark Connector. The timing is no coincidence, since next week Spark Summit Europe 2015 is happening in Amsterdam. We are sponsoring the event, and as a result you can find me and my colleagues there at the Couchbase booth!

This stable release marks the end of larger breaking changes, bringing stability into the API and a clear path going forward. If you haven’t read the previous announcements, the following post provides a whirlwind tour of the features and capabilities.

The Connector is distributed from Maven Central (as well as spark-packages.org), so if you want to experiment with it using the spark-shell, this is all you need to get up and running:

To whet your appetite, here is a full code sample you can execute against our “travel-sample” dataset. It uses Spark SQL to create a data frame for all airlines (based on a predicate you specify) and then selects some fields and applies ordering as well as a limit:

This prints:

In a few lines of code you can run all kinds of queries for data analysis, ETL or machine learning on top of Couchbase. To me that’s pretty awesome – if you also like it read on for all the details.

By the way, the full documentation can be found here.

Spark Core – The Scalable Foundation

The lowest user-facing API in Spark are the RDD (Resilient Distributed Datasets). It is basically a collection of data, which spark distributes all over the cluster. Since Spark is a big data crunching machine but not a database, it needs mechanisms to create RDDs as well as to persist RDDs at the end of the computations. To assist with this, Couchbase provides:

API to create RDDs through KeyValue, Views and N1QL
Persist RDDs into a Couchbase Bucket through KeyValue

The detailed documentation for those tasks is availabe here. The following code samples show you how to create RDDs easily as well as persist them. Note that these samples just expect a SparkContext to be available.

And here a more complicated example which reads all airlines, performs a classic word count on their names, aggregates the results and stores them in a document back in the Couchbase cluster:

As you can imagine, behind the scenes lots of things are going on. The API is turned into Couchbase queries, but more importantly the connector handles resources completely transparently. Since your computations will be executed on arbitrary workers in the cluster, the connector opens connections where needed in an efficient fashion. So you just need to tell Spark what do fetch or persist – the connector will handle the rest.

If you run Spark workers side-by-side Couchbase nodes, the connector tries to hint the proper worker for KeyValue operations (again, transparently). That way expensive network shuffle operations are reduced, leading to even better performance under such setups. Note that this is a pure optimization, you can run any topology you like and it will just work.

Spark SQL – A N1QL Lovestory

Spark SQL is a module for working with structured data. It allows the user to put a schema over an RDD, which is then called a DataFrame (previously SchemaRDD). Because Spark now has structure information of the data it is working with, it can apply all kinds of transformations and optimizations.

Couchbase Server 4.0 includes the brand new N1QL query language, which blends perfectly into the Spark SQL APIs. There is only one gotcha: documents stored in Couchbase are not required to adhere to a specific schema – that’s one of its features. So how do we bring structure in a schemaless world?

The answer to that is automatic schema inference. If you create a DataFrame on top of Couchbase, you need to provide a “schemaFilter” which in turn will internally create a predicate. Then we will load lots of documents with that predicate and infer the schema from there. The following example shows how to create a DataFrame for airlines in the “travel-sample” bucket, which are identified by their type attribute in the document itself:

This prints:

If your documents are more or less similar, this approach works well. If your documents are completely schemaless so that every document looks very different, you can also provide the schema manually. This way, you specify only the fields you potentially need:

Finally, if this still doesn’t work you can allways fall back to an RDD query and generate a DataFrame from the results:

This prints:

You can see how it even detects the recursive structure of JSON objects and arrays. This can be utilized as well at query time, giving you flexibility in both data modeling and querying.

Now that you have your DataFrame created, you can perform all kinds of queries against it:

This prints:

Here is a different example which shows how you can create a DataFrame from HDFS and join it with Couchbase rows:

One important piece of this is handled under the covers as well: the required fields and predicates are pushed down to the N1QL query engine on the server, so we only compute and transfer essential data, allowing for more efficient networking and CPU resource handling.

Spark Streaming – In-N-Out in (soft) Realtime

Spark Streaming brings a microbatch streaming approach to Spark, allowing you to perform both batching and streaming applications in one system. Couchbase allows you to persist such streams into Couchbase as well as (exerpimentally) creating such a stream through its internal document change protocol (DCP).

Persisting a DStream works the same way than persisting an RDD – you just need to use the right implicit import and convert it into a Document representation. The following examples shows you how to persist the content of tweets in a twitter feed into couchbase:

You can find more information about Spark Streaming support here.

The Road Ahead

Getting this first stable release out of the door was important. The next release (1.1) will bring official compatibility with Spark 1.5, as well as other enhancements and stability fixes. As always, please try out the connector and provide feedback on what you think we should improve.

Happy hacking, no bugs and quick shuffle operations!

Share this article

게시 카테고리: Uncategorized

Get Couchbase blog updates in your inbox

Please leave this field empty

By checking this box, you acknowledge our Privacy Policy. You may unsubscribe at any time.

This field is required.

Check your inbox or spam folder to confirm your subscription.

Author

게시자: Michael Nitschinger

Michael Nitschinger works as a Principal Software Engineer at Couchbase. He is the architect and maintainer of the Couchbase Java SDK, one of the first completely reactive database drivers on the JVM. He also authored and maintains the Couchbase Spark Connector. Michael is active in the open source community, a contributor to various other projects like RxJava and Netty.

모든 게시물

1개의 응답

Mark

2016년 7월 28일 14:13

Hi.

How would this code look in databricks because if you run currently there is an error: developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()

Code I am refering to:

// Generate The Generic Spark Context
val sc = new SparkContext(new SparkConf().setAppName(“example”)
.setMaster(“local[*]”)
.set(“com.couchbase.bucket.travel-sample”, “”))

// Setup Spark SQL
val sql = new SQLContext(sc)

// Create a DataFrame with Schema Inference
val airlines = sql.read.couchbase(schemaFilter = EqualTo(“type”, “airline”))

// Perform the query
airlines
.select(“name”, “iata”, “icao”)
.sort(airlines(“name”).asc)
.limit(5)
.show()

Thanks,

Mark

로그인 하여 답글 남기기

Ready to get Started with Couchbase Capella?

Start building

Check out our developer portal to explore NoSQL, browse resources, and get started with tutorials.

Develop now

Use Capella free

Get hands-on with Couchbase in just a few clicks. Capella DBaaS is the easiest and fastest way to get started.

Use free

Get in touch

Want to learn more about Couchbase offerings? Let us help.

3155 Olsen Drive,
Suite 150, San Jose,
CA 95117, United States

Company

Blog
Downloads
Online Training
Resources
Why NoSQL
Pricing
Trust Center

Support

Developer Portal
Documentation
Forums
Professional Services
Support Login
Support Policy
Training

Quicklinks

Blog
Downloads
Online Training
Resources
Why NoSQL
Pricing
Trust Center

Twitter
LinkedIn
YouTube
Facebook
Github
Stack Overflow
Discord

© 2026 Couchbase, Inc. Couchbase and the Couchbase logo are registered trademarks of Couchbase, Inc. All third party trademarks (including logos and icons) referenced by Couchbase, Inc. remain the property of their respective owners.

Terms of Use
Privacy Policy
Cookie Policy
Support Policy
Do Not Sell My Personal Information
Marketing Preference Center
Trust Center

Couchbase. The Operational Data Platform for AI.^® Trademark registration in Switzerland.

Platform

Services

Self-Managed

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

Couchbase Spark Connector 1.0.0 Released

Azure Key Vault for Credentials

Your AI Agents Are Stuck in Pilot. It’s a Data Problem, Not a Model Problem.

When the Internet Goes Down, Your Business Shouldn’t

Distributed Databases: An Overview

On-Device AI: Benefits, Use Cases, and Challenges

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch

Platform

Services

Self-Managed

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

Couchbase Spark Connector 1.0.0 Released

Spark Connector 1.0.0 Released

Spark Core – The Scalable Foundation

Spark SQL – A N1QL Lovestory

Spark Streaming – In-N-Out in (soft) Realtime

The Road Ahead

Get Couchbase blog updates in your inbox

Author

게시자: Michael Nitschinger

1개의 응답

댓글 남기기 응답 취소

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch