Array is THE difference between the relational model and the JSON model.   — Gerald Sangudi

Abstract

JSON array gives you flexibility in the type of elements,  number of elements, size of the elements, and the depth of the elements.  This adds to the flexibility of operational JSON databases like Couchbase and MongoDB. The performance of queries with an array predicate in operational databases depends on the array indexes. However, the array indexes in these databases come with significant limitations. E.g. only one array key is allowed per index. The array indexes, even when created, can only process AND predicates efficiently.  The upcoming Couchbase 6.6 release removes these JSON limitations by using a built-in inverted index to be used to index and query arrays in N1QL.  This article explains the background and the workings of this novel implementation.

Introduction

An array is a basic type built into JSON defined as An array is an ordered collection of values. An array begins with [left bracket and ends with ]right bracket. Values are separated by ,commaAn array gives you flexibility because it can contain an arbitrary number of scalar, vector, and object values. A user profile can have an array of hobbies, a customer profile an array of cars, a member profile an array of friends. Couchbase N1QL provides a rich set of operators to manipulate arrays; MongoDB has a list of operators to handle arrays as well.

Before you start querying, you need to model your data in arrays. All JSON document databases like Couchbase, MongoDB recommend you to denormalize your data model to improve your performance and appdev. What that means is, transform your 1:N relationship into a single document by embedded the N into 1.  In JSON, you’d do that by using an array.  The example below, the document(1) contains 8 (N) likes. Instead of storing a foreign key reference to another table, in JSON, we store data inline. 

Values here are arrays of strings. In JSON, each element can be any valid JSON type: scalars (numeric, string, etc), or objects or vectors (arrays).  Each hotel document contains an array of reviews.  This is the process of denormalization.  Converting multiple 1:N relationships to a single hotel object that contains N public_likes and M reviews.  With this, the hotel object embeds two arrays: public_likes and reviews.  There can be any number of values of any type under these arrays.  This is the key contributing factor to JSON Schema flexibility.  When you need to add new likes or reviews, you simply add a new value or an object to this. 

Like the hotel object above, you denormalize your data model into JSON, there could be many arrays for each object. Profiles have arrays for hobbies, cars, credit cards, preferences, etc.  each of these can be scalars (simple numeric/string/boolean values) or vectors (arrays of other scalars, arrays of objects, etc). 

 

Once you model and store the data, you need to process it — select, join, project. Couchbase N1QL (SQL for JSON) provides an expressive language to do these and more.   Here are common use cases.

Array Indexing:

Indexing arrays are a challenge for B-tree based indexes.  However, the JSON database has to do it to meet the performance requirements: MongoDB does it; Couchbase does it. However, both come with limitations.  You can only have one array key within an index.  This is true of MongoDB; this is true of Couchbase N1QL.  The core reason for this limitation is when you index elements of an array, you need separate index entries.  

The size of the index grows exponentially as the number of array keys in the index and the number of array elements in the index.    Hence the limitation.   Implications of this limitation are: 

  • Push only one array predicate to the index scan and handle other predicates after the index scan.
    • This means queries with multiple array predicates could be slow
  • Avoid composite indexes with array keys to avoid huge indexes.
    • This means queries with complex predicates on array keys will be slow

Good news from the LEFT FIELD.

The full-text search index was designed to handle text pattern search based on relevance.   The way it does by tokenizing each field.  In this example below, each document is analyzed to get tokens:

For each token, it keeps the list of documents that token is present.  This is the inverted tree structure! Unlike a B-Tree based index, it avoids repeating the same token value N number of times, one for each document it’s present in.  When you have millions or billions of documents, this is a huge saving. 

The second thing to note here is, the inverted index used for an ARRAY OF TOKENS!    In fact, the inverted tree structure in the full-text-search is ideal for indexing and searching for array values, especially when these values have duplicates. 

Indexing arrays using the inverted index will be the same process, except there’s no tokenization. Let’s revisit indexing our document “bob”, with additional documents, “Sam” and “Neill”

The Couchbase FTS has an analyzer called a keyword analyzer. This indexes the values as-is instead of trying to find its root by stemming it. Basically, value is the token.  For array value indexing, we can use this index and exploit the efficiencies of an inverted index.  Let’s construct an FTS index on the documents bob, sam, neil.  In the case of the inverted tree, each field gets its own inverted tree: one each for id, a, and b.  Because these are individual trees, they don’t grow exponentially as the B-tree composite index. The number of index entries is proportional to the number of unique items in each field.  In this case, we have 14 entries for the 3 documents with three fields of a total of 24 values.  Creating a B-tree index on (id, a, b) for the same document will create 36 entries!

Notice that for three documents with two index entries the difference is 157%. As the number of documents, the number of arrays increases, the savings using an inverted index increase as well.

Inverted index on three fields.

Inverted index on three fields.

However, we do have an issue.  How you process predicates?

The B-Tree index stores all of the values in (id, a, and b) together, the inverted index in FTS has distinct trees for each field. So, applying multiple predicates is not so easy.  This is true for our array processing as well as text processing.  It’s common in text processing to ask questions like: search for all Californian residents with skiing as their hobby.

To process this, FTS applies the predicate on each field individually to get the list of document-keys for each predicate.  It then applies the boolean predicate AND on top of it. This layer uses the famed roaring bitmap package to create and process the bitmap of document ids to finalize the result.   Yes, there is extra processing here compared to a simpler B-TREE based index, but this makes it possible to index many arrays and process the query in a reasonable time.

Inverted Tree:  A tree that keeps on giving!

The B-Tree composite index combines the scanning and the AND predicate application.   The inverted tree approach separates the two.  Index and scanning each field is different from processing the composite predicate.  Because of this separation,  the bitmap layer can process OR, NOT predicates along with AND predicates.  Changing the AND in the previous example to OR is simply an instruction to the bitmap processing on the document qualification and deduplication

COUCHBASE Release:

Couchbase 6.6 will support using FTS indexes for processing complex array predicates.  This improves the TCO of array handling, enables developers and designers to use, index, query arrays as needed without limitations. Look for upcoming announcements, documentation, feature blogs, etc.

References

  1. Working with JSON Arrays in N1QL
  2. Utilizing Arrays: Modeling, Querying, and Indexing
  3. Couchbase N1QL Collection Operators
  4. MongoDB: Query an arrray
  5. Couchbase FTS
  6. FREE: Couchbase interactive training
  7. FTS BLogs:  https://www.couchbase.com/blog/tag/fts/
  8. Collection operators
  9. ARRAY indexing
  10. Making most of your arrays…with N1QL Array Indexing
  11. Making most of your Arrays.. with Covering Array Indexes and more.
  12. Couchbase Indexing
  13. NEST and UNNEST: Normalizing and Denormalizing JSON on the Fly

Author

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and has three US patents pending.

Leave a reply