Analytics Explain Plan - Part 1 - The Couchbase Blog

Co-author: Till Westmann, Senior Director, Engineering

Introduction

Couchbase Analytics is the “newest kid on the block” amongst all the services available in Couchbase Server. The new service is providing rapid time to insight in many use cases like ship to shore, shopping cart analysis, store placement of goods, airline ticket bookings, hotel inventory and many more. Couchbase Server delivers unmatched performance in some of the most demanding use cases, and the expectations from the Analytics service are no different. Over the last few months, we’ve optimized N1QL (SQL for JSON) queries for customers who’ve deployed Analytics in a variety of different deployment topologies. “How can I interpret the Analytics query plan?” is a recurring request and I will demystify the Analytics query plan in a blog post series.

In this blog post, part 1, I will provide background on the execution framework and then explain the plan for a simple query. In the next blog of this series I will explain the plan for more complex queries – including joins, index accesses, and aggregations.

Background

Couchbase Analytics adds parallel data management to Couchbase Server to complement the capabilities offered by the Query and Index services. Couchbase Analytics has a full MPP (massively parallel processing) based query processor that splits the work of processing a single query across all of the Analytics nodes in a Couchbase data platform cluster. As a result, you can run complex analytical queries – ad-hoc joins, set aggregation and grouping operations- quickly and in a scalable manner.

The picture below is a simplified conceptual representation of how a query is processed by the Analytics engine. The picture represents what goes on in a single node and hides the details of the MPP aspects of the query engine.

A request is submitted to the Analytics Service and after it is authenticated, the query string is passed to the query compiler.
The query compiler parses and translates the query and produces an optimized query plan. A query plan is a tree (or a DAG) of operators and connectors. These operators are similar to the relational algebra operators in relational databases. The key difference here being that the operators in Couchbase Analytics are also capable of handling nested and schemaless JSON data. The details are outside the scope of this blog post, but you can get under the hood in this video presentation from Prof Mike Carey, Chief Architect of Couchbase Analytics.
The query plan is handed off to the execution engine which evaluates the operators in the query plan. Analytics is designed for big data and can gracefully handle data spilling from memory to disk as needed.
The buffer cache is used by the execution engine to read stored data from disk and cache the data in memory for quicker access as needed.
The results of the execution are passed to the response handler.
The response handler then passes the results back to the client.

Query Plan

A query plan describes the path of a document through the execution engine. The operations in the plan will be executed for each qualifying document.

Let me illustrate this with a very simple query below which selects the name is all breweries in California.

select name 
from breweries 
where state=”California”

select name

from breweries

where state=”California”

The best way to read the plan would be bottoms-up. It would give you a view of how data is accessed and the steps that follow for query execution.

Query Plan Operators Explanation

Distribute-result: This is the root of the plan that receives the query result. When a client requests the results (synchronously or asynchronously), the results are retrieved from each node and sent to the client.

This is a parallel operation that is partitioned across all nodes in the cluster.

Project: Projects out the field $$breweries and keeps $$15 (which contains the name of the brewery).

This is a parallel operation that is partitioned across all nodes in the cluster.

Assign: This operator evaluates one or more expressions and assigns the results to new variables. In this case the value of the key “name” is assigned to variable $15.

This is a parallel operation that is partitioned across all nodes in the cluster.

Select: Selects the matching records based on the predicate defined in the query. In this case the brewery records that belong to “California” are selected and the rest are filtered out.
This is a parallel operation that is partitioned across all nodes in the cluster.

Project: Projects out all $$16 and $$17 and keeps $$breweries.

This is a parallel operation that is partitioned across all nodes in the cluster.

Data-scan: Reads each record from the dataset “breweries”. “vars” is the list of variables that are created as a result of this operation. The first key for each record is assigned to $$16, the whole record is assigned to $$breweries, and some record metadata is assigned to $$17.
This is a parallel operation that is partitioned across all nodes in the cluster.

The explain plan is a great tool to optimize analytic queries running in Couchbase Server and this introduction should help demystify the plan and interpret it. You can try it yourself by downloading the latest version Couchbase Server 6.5 and engage with us on forums to get your questions answered.

Sachin Smotra, Director Product Management, Couchbase

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Analytics Explain Plan – Part 1

Introduction

Background

Query Plan

Author

Posted by Sachin Smotra, Director Product Management, Couchbase

Leave a reply Cancel reply