Co-author: Till Westmann, Senior Director, Engineering
The bits for Couchbase 6.5, the upcoming release of Couchbase Server, are now available. The Analytics service, which supports efficient parallel query processing, has added support for support for window functions defined in the sql:2003 standard.
For many of us (yours truly included) window functions is this magical feature that we rarely use. But window functions make SQL queries more concise thereby improving readability which makes it easy to maintain queries for real world applications. The queries expressed by window functions would otherwise involve expensive self-joins and subqueries which are hard to optimize resulting in slow running queries. I’ll illustrate the point by calculating a running total which is the most practical example for using window functions.There are multiple ways of calculating a running total in SQL. In this blog post, I will calculate the running total in two ways – the old school way using an INNER JOIN and the newly available Window Functions in Couchbase 6.5.
What is a Running Total?
A Running Total is a total that is continually adjusted to take account of items as they are added. Another term for it is partial sum. In the example below, the running total for a transaction date would be the InvoiceID’s running total plus the current TransactionAmount.
While I have the RunningTotal calculated above, please note the first three fields are the source data and the RunningTotal needs to be calculated using a SQL query.
Download Couchbase Server 6.5 and setup a Couchbase bucket for the sample data to run the queries in this blog post.
Create a Couchbase bucket called “invoices” with the data below.
On the Analytics workbench, run the following statement to create the dataset
CREATE DATASET invoices on invoices;
CONNECT LINK Local;
Inner Join query for Running total on a given day
For each InvoiceID, we need to retrieve the transaction amount and all transaction amounts before it. This is accomplished using an Inner Join with a condition to get each invoice and those prior to it.
SELECT inv1.InvoiceID , inv1.TransactionDate, inv1.TransactionAmount, SUM(inv2.TransactionAmount) AS RunningTotal
FROM invoices inv1 JOIN invoices inv2 ON inv1.InvoiceID >= inv2.InvoiceID
WHERE inv1.TransactionDate = "2019-08-11"
GROUP BY inv1.InvoiceID, inv1.TransactionDate, inv1.TransactionAmount
ORDER BY inv1.InvoiceID
JOINS and NoSQL haven’t always played well together. In fact, the conventional wisdom is that data in the NoSQL world is denormalized to avoid JOINs. But Couchbase Analytics was created for data exploration and ad-hoc analysis that includes complex joins and aggregations. The Analytics query engine can process JOINs that are not supported by other NoSQL vendors –
In case you are wondering how Couchbase Analytics can join JSON data, I recommend getting under the hood in this video presentation with Prof Mike Carey, Consulting Chief Architect for Couchbase.
Window function query for Running total on a given day
If you are not familiar with window functions, I would recommend reading PostgresSQL’s documentation which provides a great introduction:
“A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.”
Here is what the actual query looks like –
SELECT InvoiceID, TransactionDate, TransactionAmount, SUM(TransactionAmount) OVER (ORDER BY InvoiceID) as RunningTotal
WHERE TransactionDate = "2019-08-11"
ORDER BY InvoiceID
In addition to the simplicity and conciseness, there are a couple of interesting things about the above statement –
- There is an aggregate function but no group by clause. Typically aggregate functions require a group by clause but since we are using the OVER clause, SUM is considered a window function.
- The OVER clause retrieves a specific set of rows, relative to the current row, and perform an operation over a specific field.
- ORDER BY defines the logical order in which the window function is evaluated.
Note: Couchbase supports window functions for both the Query and Analytics services and the same query can be run in both the query and the analytics workbenches. To run on the query service, the following indexes need to be created
CREATE PRIMARY INDEX ON invoices;
CREATE INDEX invoices_id ON invoices(InvoiceID);
The purpose of this blog post was to introduce window functions in Couchbase Analytics. Window functions are NOT new to SQL, they are a part of the SQL standard and are available in all major relational databases and Big Data tools like Hive and Spark. With the 6.5 release, Couchbase is bringing these to document databases. If you’d like to experience this first hand – please download the bits and engage with us on forums.