How to Join JSON: Couchbase N1QL vs. MongoDB Query

As NoSQL databases evolved, each added higher level APIs or languages to help programmers to complex things easily. SQL, having done that for relational data, showed the way. In SQL, developers say WHAT needs to be done and the database engine figures out the HOW. HOW is the efficient procedure/algorithm to execute the statement. Select, join and project are the basic operations SQL processing. Even in NoSQL systems when you model data without much normalization, you still need to join a collection of objects. Customers with orders, orders with inventory, inventory with suppliers, suppliers with credits and so forth. Hence, Couchbase N1QL has supported join operations since its first release. Following that, MongoDB, in version 3.2, added $lookup operator to the aggregation framework to perform the join operations.

Without an expressive and high-performance query feature, application developers have to do it within the application or export the data to a system that does it. Both expensive propositions.

In this article, we will be comparing Couchbase vs. MongoDB and their differing approaches to join JSON documents. Specifically, we do a comparative study of using joins for MongoDB collections versus how we can execute them in Couchbase. Joins are unsupported in Cassandra CQL and DynamoDB natively. Developers must do the work themselves or use other layers like Spark or Amazon EMR to achieve the same result. So, we won’t cover them in this article.

Joins in Couchbase

Couchbase introduced INNER and LEFT OUTER joins starting with Couchbase 4.0 (2015). This supported joins of in a child-to-parent relationship. Children documents (e.g. Orders) can be joined with parent documents (e.g. customer). In 4.5 (2016), Couchbase introduced index joins to query from parent to child joins. In both cases, there was an implied attribute-value to document-key equality predicate, specified by the ON KEY clause.

Couchbase 5.5 has ANSI standard SQL extended for JSON. It supports INNER JOIN, LEFT OUTER JOIN and limited RIGHT OUTER join. We’ll be using examples based on Couchbase 5.5.

Couchbase joins documentation: https://developer.couchbase.com/documentation/server/5.5/n1ql/n1ql-language-reference/from.html

MongoDB Joins and Collections:

Joins are supported via the $lookup operator within the aggregation framework.

Following are excerpts from the MongoDB documentation.

New in version 3.2.

Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection. The $lookup stage passes these reshaped documents to the next stage.

Eliot Horowitz, MongoDB CTO, said: “MongoDB aggregation is similar to Unix pipeline. The output of one stage goes into another….[it’s] very procedural. Lets you think about in a very procedural way.”

MongoDB $lookup : https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

Interested in learning more on JOINs? Read the article by Lukas Eder. https://dzone.com/articles/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql

High Level Comparison between Couchbase N1Ql and MongoDB.

Couchbase N1QL: Supports INNER JOIN, LEFT OUTER JOIN and limited RIGHT OUTER JOIN. The query language, like SQL, is declarative. Developers write, tools generate the query to the N1QL syntax. The engine figures out the plan and executes the query.

MongoDB: Supports LEFT OUTER JOIN for scalar values only. The design of the joins into the MongoDB query language is done help write the query and process data in a procedural way.

Implication:

Left outer join resultset is a superset of inner join resultset. It’s possible to add additional predicates to eliminate the non-matching (null-projected or missing subservient side of the join) documents after the left outer join is performed. That’s like going from San Francisco to Chicago via London. You can do it, but it’s expensive. For the query execution, it takes time, memory, cpu resources affecting the overall performance of the system.
N1QL support for joins is declarative. MongoDB language is somewhat procedural. You’ve to separate the predicates, think about the join order between collections, think about when to group, sort, etc. Writing queries with MongoDB aggregation is like writing query plans, step by step.

Examples:

We use the simple travel-sample model and data. Here are the details of the model data. https://developer.couchbase.com/documentation/server/4.5/travel-app/travel-app-data-model.html

We simply exported the data from Couchbase and imported it to a mongo database called travel-sample. In MongoDB, the 5 different types of the document (landmark, route, airline, airport, hotel) are stored in 5 collections with respective names.

Example 1: LEFT OUTER JOIN with ON clause on scalar values.

Couchbase N1QL


SELECT count(*)
FROM `travel-sample` route
         LEFT OUTER JOIN `travel-sample` airline
         ON (route.airlineid = META(airline).id)
WHERE route.type = ‘route’;

SELECT count(*)

FROM `travel–sample` route

LEFT OUTER JOIN `travel–sample` airline

ON (route.airlineid = META(airline).id)

WHERE route.type = ‘route’;

MongoDB Query


db.route.aggregate([
   {
     $lookup:
       {
         from:”airline”,
         localField: “airlineid”,
         foreignField: “_id”,
         as: “airline_docs”
       }
   },
   { $group: { _id: null, myCount: { $sum: 1 } } },
   { $project: { _id: 0 } }
 ]);

db.route.aggregate([

{

$lookup:

{

from:“airline”,

localField: “airlineid”,

foreignField: “_id”,

as: “airline_docs”

}

{ $group: { _id: null, myCount: { $sum: 1 } } },

{ $project: { _id: 0 } }

]);

Observations:

This is a fairly simple left outer join query joining two collections and then simply counting the total number of documents produced. Notice, unlike N1QL (and SQL), in MongoDB, you’d still have to group the resultset to get the count, even if you have a single group.

Example 2: List the airports and landmarks in the same city, ordered by the airports.

Couchbase N1QL:


SELECT landmark.name AS Landmark_Name,
       MIN(airport.airportname) AS Airport_Name,
       MIN(airport.tz) AS Landmark_Time
FROM `travel-sample` airport INNER JOIN `travel-sample` landmark
      ON airport.city = landmark.city
WHERE landmark.country = “United States”
      AND airport.type = “airport”
      AND landmark.type = “landmark”
GROUP BY landmark.name
ORDER BY Airport_Name

SELECT landmark.name AS Landmark_Name,

MIN(airport.airportname) AS Airport_Name,

MIN(airport.tz) AS Landmark_Time

FROM `travel–sample` airport INNER JOIN `travel–sample` landmark

ON airport.city = landmark.city

WHERE landmark.country = “United States”

AND airport.type = “airport”

AND landmark.type = “landmark”

GROUP BY landmark.name

ORDER BY Airport_Name

MongoDB:


db.airport.aggregate([
   {
     $lookup:
       {
         from:”landmark”,
         localField: “city”,
         foreignField: “city”,
         as: “aplm_docs”
       }
   },
   {
      $match: {“airline_docs”: {$ne: []}}
   },
   {  $unwind: { path: “$aplm_docs”, preserveNullAndEmptyArrays: true }},
   {  $group: {
      _id: “$aplm_docs.name”,
      Airport_Name: { $min: “$airportname” } ,
      Landmark_Time: { $min: “$tz”}
     }
   },
   { $sort : { Airport_Name: 1 } },
   { $project: { _id: 1, Airport_Name:1, Landmark_Time:1  } }
 ]);

db.airport.aggregate([

{

$lookup:

{

from:“landmark”,

localField: “city”,

foreignField: “city”,

as: “aplm_docs”

}

{

$match: {“airline_docs”: {$ne: []}}

{ $unwind: { path: “$aplm_docs”, preserveNullAndEmptyArrays: true }},

{ $group: {

_id: “$aplm_docs.name”,

Airport_Name: { $min: “$airportname” } ,

Landmark_Time: { $min: “$tz”}

}

{ $sort : { Airport_Name: 1 } },

{ $project: { _id: 1, Airport_Name:1, Landmark_Time:1 } }

]);

Observations:

This query uses INNER JOIN which MongoDB does not have. So, in MongoDB, you first do the lookup join to get the LEFT OUTER JOIN, and then eliminate non matching, but projected documents (because of the left outer) using the match stage (code: $match: {“airline_docs”: {$ne: []}}).
Then, you’ve to remember the matched documents are in an array data structure, unwind them before you group them by the landmark.name. Then do the sort and final projection.

As expected, the MongoDB join query is procedural and you’ve to understand the execution plan and write code for each stage.

Example 3: Starting from San Francisco, find all the destination airports (those have routes from SFO).

Couchbase N1QL


SELECT DISTINCT route.destinationairport
FROM `travel-sample` airport JOIN `travel-sample` route
     ON (airport.faa = route.sourceairport AND route.type = “route”)
WHERE airport.type = “airport”
  AND airport.city = “San Francisco”
  AND airport.country = “United States”
ORDER BY route.destinationairport

SELECT DISTINCT route.destinationairport

FROM `travel–sample` airport JOIN `travel–sample` route

ON (airport.faa = route.sourceairport AND route.type = “route”)

WHERE airport.type = “airport”

AND airport.city = “San Francisco”

AND airport.country = “United States”

ORDER BY route.destinationairport

MongoDB:


db.airport.aggregate([
   {
      $match: {
          $and: [
            {“type”: “airport”},
            { city: “San Francisco”},
            { “country”: “United States”}
          ]
      }
   },
   {
     $lookup:
       {
         from:”route”,
         let: { rfaa : “$faa”},
         pipeline: [
           { $match:
               { $expr:
                 { $and:
                   [
                     { $eq: [“$sourceairport”, “$$rfaa”]} ,
                     { $eq: [“$type”, “route”] }
                   ]

                }
              }
           }
         ],
         as: “airline_docs”
       }
   },
   { $match: {“airline_docs”: {$ne: []}} },
   {  $unwind: { path: “$airline_docs”, preserveNullAndEmptyArrays: true }},
   { $project: { _id:0,  “airline_docs.destinationairport” : 1 }},
   { $group: {
      _id : “$airline_docs.destinationairport”
     }
   },
   { $sort: { _id : 1 }},
 ]);

db.airport.aggregate([

{

$match: {

$and: [

{“type”: “airport”},

{ city: “San Francisco”},

{ “country”: “United States”}

]

}

{

$lookup:

{

from:“route”,

let: { rfaa : “$faa”},

pipeline: [

{ $match:

{ $expr:

{ $and:

[

{ $eq: [“$sourceairport”, “$$rfaa”]} ,

{ $eq: [“$type”, “route”] }

]

}

as: “airline_docs”

}

{ $match: {“airline_docs”: {$ne: []}} },

{ $unwind: { path: “$airline_docs”, preserveNullAndEmptyArrays: true }},

{ $project: { _id:0, “airline_docs.destinationairport” : 1 }},

{ $group: {

_id : “$airline_docs.destinationairport”

}

{ $sort: { _id : 1 }},

]);

Observations:

The join clause for this query is a bit more complex, with two predicates (airport.faa = (route.sourceairport AND route.type = “route”). This requires a cumbersome pipeline syntax on the MongoDB query.
And because you need to differentiate between the two collections, you need another let stage to create the local variables for airport attributes.
Like before, it requires an additional match clause to eliminate non-matching (empty) airline docs, followed by grouping and sorting.
As you can see visually, the MongoDB query is getting larger and larger to do the same job as Couchbase N1QL.

Example 4: Find all the hotels and landmarks in Yosemite. Hotels should have altleast 5 likes.

Couchbase N1QL


SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity
FROM `travel-sample` hotel INNER JOIN `travel-sample` landmark
    ON (hotel.city = landmark.city 
        AND hotel.country = landmark.country 
        AND landmark.type = “landmark”)
WHERE hotel.type = “hotel” 
  AND hotel.title like “Yosemite%” 
  AND array_length(hotel.public_likes) > 5;

SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity

FROM `travel–sample` hotel INNER JOIN `travel–sample` landmark

ON (hotel.city = landmark.city

AND hotel.country = landmark.country

AND landmark.type = “landmark”)

WHERE hotel.type = “hotel”

AND hotel.title like “Yosemite%”

AND array_length(hotel.public_likes) > 5;

MongoDB:


db.hotel.aggregate([
   { $match:  { title: { $regex: /^Yosemite/ } }, },
   {
     $lookup:
       {
         from:”landmark”,
         let: { hcity : “$city”, hcountry : “$country”},
         pipeline: [
           { $match:
               { $expr:
                 { $and:
                   [
                     { $eq: [“$city”, “$$hcity”]} ,
                     { $eq: [“$country”, “$$hcountry”] }
                   ]
                }
              }
           }
         ],
         as: “hotel_lm_docs”
       }
   },
   { $match : {“hotel_lm_docs”: { $ne: [] }}},
   { $project: {_id:0, hname: “$name”, public_likes: 1, hotel_lm_docs:1}},
   { $unwind: { path: “$hotel_lm_docs”, preserveNullAndEmptyArrays: true }},
   { $project: { _id: 1, hname : 1 , “hotel_lm_docs.name” : 1, “hotel_lm_docs.name” : 1, “hotel_lm_docs.activity” : 1, mt5 : {$gt: [ {$size: “$public_likes”}, 5]}}},
   { $match: { mt5 : true } },
   { $project: {_$id:0}}
 ]);

db.hotel.aggregate([

{ $match: { title: { $regex: /^Yosemite/ } }, },

{

$lookup:

{

from:“landmark”,

let: { hcity : “$city”, hcountry : “$country”},

pipeline: [

{ $match:

{ $expr:

{ $and:

[

{ $eq: [“$city”, “$$hcity”]} ,

{ $eq: [“$country”, “$$hcountry”] }

]

}

as: “hotel_lm_docs”

}

{ $match : {“hotel_lm_docs”: { $ne: [] }}},

{ $project: {_id:0, hname: “$name”, public_likes: 1, hotel_lm_docs:1}},

{ $unwind: { path: “$hotel_lm_docs”, preserveNullAndEmptyArrays: true }},

{ $project: { _id: 1, hname : 1 , “hotel_lm_docs.name” : 1, “hotel_lm_docs.name” : 1, “hotel_lm_docs.activity” : 1, mt5 : {$gt: [ {$size: “$public_likes”}, 5]}}},

{ $match: { mt5 : true } },

{ $project: {_$id:0}}

]);

Observation:

Translating the LIKE predicate into a regular expression was straightforward, but determining if there were atleast five public_likes was not. Needed additional projection and matching phase to calculate the size of the public_likes at the end.
When you have many attributes to match, manipulate and project, you’d have to rename them properly at appropriate stage otherwise, the query can’t reference it. For example, hotel.name had to be renamed to hname before the unwind. Maybe there’s a better way to write this stage!
N1QL expressed the query in 370 characters. MongoDB required 956 characters. All this for a two table join. As the complexity increases, the ratio increases as well since the MongoDB query is written in a procedural way.

Example 5: Find all the hotels and landmarks in Yosemite. Hotels should have at least 5 likes.

This is just like Example 4, but do it faster!

Couchbase N1QL


SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity
FROM `travel-sample` hotel INNER JOIN `travel-sample` landmark USE HASH(build)
    ON (hotel.city = landmark.city 
        AND hotel.country = landmark.country 
        AND landmark.type = “landmark”)
WHERE hotel.type = “hotel”
  AND hotel.title like “Yosemite%” 
  AND array_length(hotel.public_likes) > 5;

SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity

FROM `travel–sample` hotel INNER JOIN `travel–sample` landmark USE HASH(build)

ON (hotel.city = landmark.city

AND hotel.country = landmark.country

AND landmark.type = “landmark”)

WHERE hotel.type = “hotel”

AND hotel.title like “Yosemite%”

AND array_length(hotel.public_likes) > 5;

Observation:

The default join method in Couchbase N1QL is nested loop join. This works fine when you have a smaller number of documents involved on each side of the join. When you have a larger data set, typically in reporting queries, nested loop join slows down. Couchbase N1QL has hash joins and this speeds up joins significantly. When each side of join has thousands of documents to millions of documents, the speed increase can be 2x to 20x or more. See the detailed Couchbase blog on ANSI Joins for more information.

From the documentation and explain plan, it’s unclear what join method MongoDB uses. Some of the blogs indicate that they’ve used a nested loop join to implement the $lookup operator.

Summary:

	Couchbase N1QL	MongoDB
JOIN approach	Declarative, like SQL. Allows joining between any sized and distributed data set.	Procedural with some declarative aspects (e.g. index selection). Can only join a sharded collection in an unsharded collection. To join two sharded collections, applications will have to write the join algorithm.
JOINs supported	LEFT OUTER JOIN INNER JOIN RIGHT OUTER JOIN	$lookup implements the LEFT OUTER JOIN on scalar values.
ON-clause support	Full expressions. Scalars Arrays	Implicit equality Pipeline expression Arrays should to be $unwind before the $lookup
JOIN implementation	Block Nested Loop Hash join with user defined build and probes.	Nested Loop
ON Clause	ON clause with any expression.	$pipeline expression
Array expressions in ON clause	Use ANY, IN expressions. Supports UNNEST	Pipeline with $unwind before $match
Explain	Visual explain and JSON explain	Visual explain and JSON explain
JOIN order	Left to right, as specified by the user. The optimizer is rule based.	As specified in the pipeline.
Nested JOINs	Supported via derived tables. FROM clause can have subselects which can have joins or subselects in turn.	No
JOIN predicate processing	Optimizer processes the join predicates, constant predicates and pushes the predicates to the index automatically.	Manual design of predicates for each collection, careful ordering of pipeline stages without full help from the optimizer.

How about performance? Good question. That’s for a future blog!

And now, a quote:

“A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts.”

— William Strunk, Jr. Elements of Style.

References:

Couchbase Documentation: https://docs.couchbase.com
MongoDB Documentation: https://docs.mongodb.com/
ANSI Joins in Couchbase N1QL: https://www.couchbase.com/blog/ansi-join-support-n1ql/
N1QL Tutorial: https://query-tutorial.couchbase.com/tutorial/#1

Platform

Services

Self-Managed

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

How to Join JSON: Couchbase N1QL vs. MongoDB Query

Your AI Agents Are Stuck in Pilot. It’s a Data Problem, Not a Model Problem.

When the Internet Goes Down, Your Business Shouldn’t

Distributed Databases: An Overview

On-Device AI: Benefits, Use Cases, and Challenges

Accelerating AI in Healthcare: Fix Data Infrastructure Before AI Fails Become a Board Priority

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch

Platform

Services

Self-Managed

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

How to Join JSON: Couchbase N1QL vs. MongoDB Query

Get Couchbase blog updates in your inbox

Author

Posted by Keshav Murthy

Leave a comment Cancel reply

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch