SQL has been around for a loooooong time. It is very intuitive and efficient way to process structured data and has been the choice for databases for many years. With the world of BIG DATA however, data has velocity, variety and volume. SQL can tackle the 2 “v”s fine with optimizations: “velocity” and “volume”. In fact, many new dielects of SQL (N1QL, Spark, U-SQL, Impala, Drill and more) are doing exactly that. However, “variety” is a different ballgame! Big Data is complex in the new world; it has unpredictable, constantly evolving, jagged scheme, sparse values and deeply nested structures. For that SQL needs to be extended!
Well, I am super excited to report that we have built N1QL in Couchbase Server 4 to tackle exactly these issues! N1QL can easily traverse over the complex structures in JSON (the de-facto serialization standard for complex data). Using N1QL, you can work with not only NULLs but attributes that are MISSING in various shapes of JSON you process. OR you can use operators like ANY/EVERY to query over arrays embedded within JSON document. OR you can use commands like UNNEST & NEST that can flatten or unflatten nested arrays. There are too many of these powerful extensions to count here so I won’t do that. Instead I am going to show you one hidden gem that was originally posted here by Gerald. This gem is very useful with aggregates if you are using N1QL. vs a relational database like SQL Server.
One of the big advantages of N1QL is its ability to understand the array types. Aggregation functions like MAX() are not revolutionary but with the additions of nesting and arrays, something as simple as MAX() can be super powerful. One note before I dive in – I am going to pick on TSQL and SQL Server as I contributed to TSQL in past life at Microsoft. However this is as applicable to Oracle, Postgres, Mysql, Informix or DB2 and more… Ok so: Imagine trying to find attributes of the product that has the MAX price. In SQL Server that is fairly simple query to write using TSQL;
1 2 3 |
SELECT productID, name FROM t1 WHERE price = (SELECT MAX(price) FROM t1) GO |
That is good. Here is the output – the product with the highest price is product named “c” with ID 3.
1 2 3 4 |
productID name ----------- ----------------------------------------------------------------- 3 c |
Here is the execution plan looks like. Basically The execution plan scans for the MAX price value. Once you have the value, It is a nested loop join to search for the other attributes of the product in the table like productID and name.
1 2 3 4 5 6 7 8 |
Stmt Text --------------------------------------------------------------------------------------------- select productID, name from t1 where price = (select max(price) from t1) |--Nested Loops(Inner Join, WHERE:([Expr1004]=[test].[dbo].[t1].[price])) |--Stream Aggregate(DEFINE:([Expr1004]=MAX([test].[dbo].[t1].[price]))) | |--Clustered Index Scan(OBJECT:([test].[dbo].[t1].[PK__t1__2D10D14A7FD17868])) |--Clustered Index Scan(OBJECT:([test].[dbo].[t1].[PK__t1__2D10D14A7FD17868])) |
However, N1QL has a great advantage here. Since it can process MAX() with arrays, you can return all the attributes of the document without an additional scan.
1 |
SELECT MAX([price, {productID, name]) FROM test; |
OR you can simply return the full document using the following query;
1 |
SELECT MAX([price, test]) FROM test; |
Lets take a look at the execution plan for the N1QL query. Here comes the spoiler: you see a single FETCH operation that can perform the MAX and the projection is done without a second FETCH.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
cbq>explain select MAX([price,test])from test; { "requestID":"b735ce5f-700c-4740-a065-6d4ba681129f", "signature":"json", "results":[ { "#operator":"Sequence", "~children":[ { "#operator":"PrimaryScan", "index":"#primary", "keyspace":"test", "namespace":"default", "using":"gsi" }, { "#operator":"Parallel", "~child":{ "#operator":"Sequence", "~children":[ { "#operator":"Fetch", "keyspace":"test", "namespace":"default" }, { "#operator":"InitialGroup", "aggregates":[ "max([(`test`.`price`), `test`])" ], "group_keys":[] } ] } }, { "#operator":"IntermediateGroup", "aggregates":[ "max([(`test`.`price`), `test`])" ], "group_keys":[] }, { "#operator":"FinalGroup", "aggregates":[ "max([(`test`.`price`), `test`])" ], "group_keys":[] }, { "#operator":"Parallel", "~child":{ "#operator":"Sequence", "~children":[ { "#operator":"InitialProject", "result_terms":[ { "expr":"max([(`test`.`price`), `test`])" } ] }, { "#operator":"FinalProject" } ] } } ] } ], "status":"success", "metrics":{ "elapsedTime":"1.8224ms", "executionTime":"1.7614ms", "resultCount":1, "resultSize":2347 } } |
Even though, this is a pretty cool trick and a great performance gain, We just scratched the surface of what N1QL is capable of. There is a ton more to discover with N1QL. You can get started with Couchbase Server and N1QL here with the getting started guide.
———————————————————
By the way if you’d like to try this out with SQL Server 2016 and Couchbase 4, here are the scripts to see this yourself in action;
SQL Server Script
Note: Run this in a database called “test”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
create table t1(productID int primary key, price int, name varchar(128)); go insert into t1(productID,price,name) values(1,10,'a'); insert into t1(productID,price,name) values(2,9,'b'); insert into t1(productID,price,name) values(3,12,'c'); insert into t1(productID,price,name) values(4,11,'d'); insert into t1(productID,price,name) values(5,1,'e'); go set statistics profile on go select max(price),productID,name from t1 go --Msg 8120, Level 16, State 1, Line 10 --Column 't1.productID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. select productID, name from t1 where price = (select max(price) from t1) go |
Here is the Couchbase Server Script
Note: create a bucket called “test”
insert into test(key,value) values(“1”,{productID:1, price:10, name:’a’});
1 2 3 4 5 6 7 8 9 |
insert into test(key,value) values("1",{productID:1, price:10, name:'a'}); insert into test(key,value) values("2",{"productID":2, "price":9, "name":"b"}); insert into test(key,value) values("3",{"productID":3, "price":12, "name":"c"}); insert into test(key,value) values("4",{"productID":4, "price":11, "name":"d"}); insert into test(key,value) values("5",{"productID":5, "price":1, "name":"e"}); create primary index on test; select max(price, productID, name) from t1; select max([price, test]) from test; explain select max([price, test]) from test; |