As NoSQL databases evolved, each added higher level APIs or languages to help programmers to complex things easily. SQL, having done that for relational data, showed the way. In SQL, developers say WHAT needs to be done and the database engine figures out the HOW.  HOW is the efficient procedure/algorithm to execute the statement. Select, join and project are the basic operations SQL processing. Even in NoSQL systems when you model data without much normalization, you still need to join a collection of objects.  Customers with orders, orders with inventory, inventory with suppliers, suppliers with credits and so forth. Hence, Couchbase N1QL has supported join operations since its first release.  Following that, MongoDB, in version 3.2, added $lookup operator to the aggregation framework to perform the join operations.

Without an expressive and high-performance query feature, application developers have to do it within the application or export the data to a system that does it.  Both expensive propositions.

In this article, we will be comparing Couchbase vs. MongoDB and their differing approaches to join JSON documents. Specifically, we do a comparative study of using joins for MongoDB collections versus how we can execute them in Couchbase. Joins are unsupported in Cassandra CQL and DynamoDB natively.  Developers must do the work themselves or use other layers like Spark or Amazon EMR to achieve the same result. So, we won’t cover them in this article.

Joins in Couchbase

Couchbase introduced INNER and LEFT OUTER joins starting with Couchbase 4.0 (2015). This supported joins of in a child-to-parent relationship.  Children documents (e.g. Orders) can be joined with parent documents (e.g. customer). In 4.5 (2016), Couchbase introduced index joins to query from parent to child joins.  In both cases, there was an implied attribute-value to document-key equality predicate, specified by the ON KEY clause.  

Couchbase 5.5 has ANSI standard SQL extended for JSON.  It supports INNER JOIN, LEFT OUTER JOIN and limited RIGHT OUTER join.  We’ll be using examples based on Couchbase 5.5.

Couchbase joins documentation: https://developer.couchbase.com/documentation/server/5.5/n1ql/n1ql-language-reference/from.html

MongoDB Joins and Collections:

Joins are supported via the $lookup operator within the aggregation framework.

Following are excerpts from the MongoDB documentation.

New in version 3.2.

Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection. The $lookup stage passes these reshaped documents to the next stage.

Eliot Horowitz, MongoDB CTO, said: “MongoDB aggregation is similar to Unix pipeline. The output of one stage goes into another….[it’s] very procedural. Lets you think about in a very procedural way.”

MongoDB $lookup : https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/


Interested in learning more on JOINs?  Read the article by Lukas Eder. https://dzone.com/articles/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql


High Level Comparison between Couchbase N1Ql and MongoDB.

Couchbase N1QL: Supports INNER JOIN, LEFT OUTER JOIN and limited RIGHT OUTER JOIN.  The query language, like SQL, is declarative. Developers write, tools generate the query to the N1QL syntax.  The engine figures out the plan and executes the query.

MongoDB: Supports LEFT OUTER JOIN for scalar values only.  The design of the joins into the MongoDB query language is done help write the query and process data in a procedural way. 

Implication:

  1. Left outer join resultset is a superset of inner join resultset.  It’s possible to add additional predicates to eliminate the non-matching (null-projected or missing subservient side of the join) documents after the left outer join is performed.  That’s like going from San Francisco to Chicago via London. You can do it, but it’s expensive. For the query execution, it takes time, memory, cpu resources affecting the overall performance of the system.
  2. N1QL support for joins is declarative.  MongoDB language is somewhat procedural.  You’ve to separate the predicates, think about the join order between collections, think about when to group, sort, etc.  Writing queries with MongoDB aggregation is like writing query plans, step by step.

Examples:

We use the simple travel-sample model and data.  Here are the details of the model data. https://developer.couchbase.com/documentation/server/4.5/travel-app/travel-app-data-model.html

We simply exported the data from Couchbase and imported it to a mongo database called travel-sample.  In MongoDB, the 5 different types of the document (landmark, route, airline, airport, hotel) are stored in 5 collections with respective names.

Example 1: LEFT OUTER JOIN with ON clause on scalar values.

Couchbase N1QL

MongoDB Query

Observations:

This is a fairly simple left outer join query joining two collections and then simply counting the total number of documents produced.  Notice, unlike N1QL (and SQL), in MongoDB, you’d still have to group the resultset to get the count, even if you have a single group.

Example 2: List the airports and landmarks in the same city, ordered by the airports.

Couchbase N1QL:

MongoDB:

Observations:

  1. This query uses INNER JOIN which MongoDB does not have.  So, in MongoDB, you first do the lookup join to get the LEFT OUTER JOIN, and then eliminate non matching, but projected documents (because of the left outer) using the match stage (code: $match: {“airline_docs”: {$ne: []}}).
  2. Then, you’ve to remember the matched documents are in an array data structure, unwind them before you group them by the landmark.name.   Then do the sort and final projection.

As expected, the MongoDB join query is procedural and you’ve to understand the execution plan and write code for each stage.

Example 3: Starting from San Francisco, find all the destination airports (those have routes from SFO).

Couchbase N1QL

MongoDB:

Observations:

  1. The join clause for this query is a bit more complex, with two predicates (airport.faa = (route.sourceairport AND route.type = “route”).  This requires a cumbersome pipeline syntax on the MongoDB query.
  2. And because you need to differentiate between the two collections, you need another let stage to create the local variables for airport attributes.
  3. Like before, it requires an additional match clause to eliminate non-matching (empty) airline docs, followed by grouping and sorting.
  4. As you can see visually, the MongoDB query is getting larger and larger to do the same job as Couchbase N1QL.

Example 4: Find all the hotels and landmarks in Yosemite.  Hotels should have altleast 5 likes.

Couchbase N1QL

MongoDB:

Observation:

  1. Translating the LIKE predicate into a regular expression was straightforward, but determining if there were atleast five public_likes was not.  Needed additional projection and matching phase to calculate the size of the public_likes at the end.
  2. When you have many attributes to match, manipulate and project, you’d have to rename them properly at appropriate stage otherwise, the query can’t reference it.  For example, hotel.name had to be renamed to hname before the unwind.  Maybe there’s a better way to write this stage!
  3. N1QL expressed the query in 370 characters. MongoDB required 956 characters.  All this for a two table join.  As the complexity increases, the ratio increases as well since the MongoDB query is written in a procedural way.

Example 5: Find all the hotels and landmarks in Yosemite.  Hotels should have at least 5 likes.

This is just like Example 4, but do it faster!

Couchbase N1QL

Observation:

The default join method in Couchbase N1QL is nested loop join.  This works fine when you have a smaller number of documents involved on each side of the join.  When you have a larger data set, typically in reporting queries, nested loop join slows down.  Couchbase N1QL has hash joins and this speeds up joins significantly.  When each side of join has thousands of documents to millions of documents, the speed increase can be 2x to 20x or more. See the detailed Couchbase blog on ANSI Joins for more information.

From the documentation and explain plan, it’s unclear what join method MongoDB uses. Some of the blogs indicate that they’ve used a nested loop join to implement the $lookup operator.

Summary:

 

Couchbase N1QL MongoDB
JOIN approach Declarative, like SQL.

Allows joining between any sized and distributed data set.

Procedural with some declarative aspects (e.g. index selection).

Can only join a sharded collection in an unsharded collection. To join two sharded collections, applications will have to write the join algorithm.

JOINs supported LEFT OUTER JOIN

INNER JOIN

RIGHT OUTER JOIN

$lookup implements the LEFT OUTER JOIN on scalar values.
ON-clause support Full expressions.

Scalars

Arrays

Implicit equality

Pipeline expression

Arrays should to be $unwind before the $lookup

JOIN implementation Block Nested Loop

Hash join with user defined build and probes.

Nested Loop
ON Clause ON clause with any expression. $pipeline expression
Array expressions in ON clause Use ANY, IN expressions.

Supports UNNEST

Pipeline with $unwind before $match
Explain Visual explain and

JSON explain

Visual explain and

JSON explain

JOIN order Left to right, as specified by the user. The optimizer is rule based. As specified in the pipeline.
Nested JOINs Supported via derived tables.

FROM clause can have subselects which can have joins or subselects in turn.

No
JOIN predicate processing Optimizer processes the join predicates, constant predicates and pushes the predicates to the index automatically. Manual design of predicates for each collection, careful ordering of pipeline stages without full help from the optimizer.

How about performance?  Good question.  That’s for a future blog!

And now, a quote:

“A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts.”

— William Strunk, Jr.  Elements of Style.

References:

  1. Couchbase Documentation: https://docs.couchbase.com
  2. MongoDB Documentation: https://docs.mongodb.com/
  3. ANSI Joins in Couchbase N1QL: https://www.couchbase.com/blog/ansi-join-support-n1ql/
  4. N1QL Tutorial: https://query-tutorial.couchbase.com/tutorial/#1

Author

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and has three US patents pending.

Leave a reply