Co-author: Sitaram Vemulapalli, Principal Engineer, Couchbase R&D.
“The answer my friend is hiding in JSON” – Bob Dylan
Há muitos público JSON datasets and then is awesome JSON datasets. Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.
There are many ways to get skin the JSON. You can write Python program for every report, visualization you want to do. Or, you can use N1QL (SQL for JSON) to generate the right algorithm for you to analyze JSON data. In this article, we show you how to use N1QL to extract insights quickly. We also use two features coming up in the next release: Common Table Expression (CTE) and Window Functions.
Goal: Use public JSON dataset for US Open golf scores to create a simple leaderboard, ranking, etc.
Three things you’ll do as part of this:
- Ingest the data into Couchbase easily.
- Start getting the value of this JSON data immediately.
- Shape the JSON to generate useful reports using new features quickly.
Source Data: https://github.com/jackschultz/usopen
Queries in this post are also available at: https://github.com/keshavmr/usopen-golf-queries
Data repo structure: This GitHub repo https://github.com/jackschultz/usopen contains US Open golf -2018 data. For each hole, it as a separate document for each day.
Each document has this structure. This is the document for hole 1 on day 1. The filed Ps has the list of players, each with a unique ID.
Each player’s playing statistics is following that, stroke by stroke. The players are matched to scores using the field unique ID for the player.
Start getting insights:
Before you start querying, create a primary index on the bucket.
CREATE PRIMARY INDEX ON usopen;
Task 1: Create a report of player scores by round and the final total.
After playing with JSON from bottom-up, we came up with this query. The explanation is after the query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
COM d AS ( SELECIONAR pl.hnum AS holedn, pl.ps.Nat AS país, (pl.ps.FN || " " || pl.ps.LN) AS nome, pl.ps.ID AS ID, comprimento da matriz(hps.Sks) AS pontuação, hpl.hole AS `hole`, hpl.dia AS `dia` DE ( SELECIONAR meta(usopen).id AS hnum, ps DE usopen USO chaves "holes:1:1" não registrado Ps AS ps ) pl INNER JUNTAR ( SELECIONAR TONUMBER(dividir(meta(usopen).id, ":") [1]) AS `hole`, TONUMBER(dividir(meta(usopen).id, ":") [2]) AS `dia`, hps DE usopen não registrado Rs AS rs INÚTIL rs.Hs AS hs INÚTIL hs.HPs AS hps ) hpl ON (pl.ps.ID = hps.ID) ) SELECIONAR d.nome, SUM( CASO QUANDO d.dia = 1 ENTÃO d.pontuação ELSE 0 FIM ) R1, SUM( CASO QUANDO d.dia = 2 ENTÃO d.pontuação ELSE 0 FIM ) R2, SUM( CASO QUANDO d.dia = 3 ENTÃO d.pontuação ELSE 0 FIM ) R3, SUM( CASO QUANDO d.dia = 4 ENTÃO d.pontuação ELSE 0 FIM ) R4, SUM(d.pontuação) T DE d GRUPO BY d.nome ORDEM BY d.nome |
Tabular Results (In Tabular form, from the Couchbase query workbench)
Let’s look at the query block by block.
Look at the WITH d clause. The statement untangles the JSON from PER-day-PER-hole-shot-by-shot data to simple scalar values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "d": { "ID": "37189", "país": "EUA", "dia": 1, "hole": 10, "holedn": "holes:1:1", "name" (nome): "Harold Varner", "pontuação": 6 } } |
Holedn is the document key – hole-day-number
Country is the player’s nationality
ID is the player’s unique ID.
Hole and day are obvious and score is the player’s score for that hole.
In the FROM clause of the SELECT statement, pl is the full list of players taken from the document for the first day, first hole (holes:1:1).
Rs is the players’ result, shot by shot, hole by hole. First, we unnest that array couple of times to project details on each hole and score for that hole, determined by array_length(hps.Sks).
Once we have the hole-by-hole score, it’s easy to write the final query to aggregate by the player and by day.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
selecionar d.nome, soma(caso quando d.dia = 1 então d.pontuação mais 0 final) R1, soma(caso quando d.dia = 2 então d.pontuação mais 0 final) R2, soma(caso quando d.dia = 3 então d.pontuação mais 0 final) R3, soma(caso quando d.dia = 4 então d.pontuação mais 0 final) R4, soma(d.pontuação) T de d grupo por d.nome ordem por d.nome |
**The WITH clause is the common table expression (CTE) feature in the upcoming Mad-Hatter release. The old way to do this in Couchbase 5.5 or below is using the LET clause. Post the question in Couchbase forum if you need help here).
Task 2: Now, create the full leaderboard and add the CUT information. The golfers who got cut won’t play third or the fourth round. We use this information to determine the players who got cut.
Query 2. Take the previous query and name it as a common table dx and then add the following expression to determine that cut.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
( CASO QUANDO ( d2.R1 = 0 OU d2.R2 = 0 OU d2.R3 = 0 OU d2.R4 = 0 ) ENTÃO "CUT" ELSE FALTANDO FIM ) AS CUT |
Here’s the full query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
COM dy AS ( SELECIONAR pl.hnum AS holedn, pl.ps.Nat AS país,(pl.ps.FN || " " || pl.ps.LN) AS nome, pl.ps.ID AS ID, comprimento da matriz(hps.Sks) AS pontuação, hpl.hole AS `hole`, hpl.dia AS `dia` DE ( SELECIONAR meta(usopen).id AS hnum, ps DE usopen USO chaves "holes:1:1" não registrado Ps AS ps ) pl INNER JUNTAR ( SELECIONAR TONUMBER(dividir(meta(usopen).id, ":") [1]) AS `hole`, TONUMBER(dividir(meta(usopen).id, ":") [2]) AS `dia`, hps DE usopen não registrado Rs AS rs não registrado rs.Hs AS hs não registrado hs.HPs AS hps ) hpl ON (pl.ps.ID = hps.ID) ), dx AS ( SELECIONAR d.nome, soma( CASO QUANDO d.dia = 1 ENTÃO d.pontuação ELSE 0 FIM ) R1, soma( CASO QUANDO d.dia = 2 ENTÃO d.pontuação ELSE 0 FIM ) R2, soma( CASO QUANDO d.dia = 3 ENTÃO d.pontuação ELSE 0 FIM ) R3, soma( CASO QUANDO d.dia = 4 ENTÃO d.pontuação ELSE 0 FIM ) R4, soma(d.pontuação) T DE dy AS d GRUPO BY d.nome ORDEM BY d.nome ) SELECIONAR d2.nome, d2.R1, d2.R2, d2.R3, d2.R4, d2.T,( CASO QUANDO ( d2.R1 = 0 OU d2.R2 = 0 OU d2.R3 = 0 OU d2.R4 = 0 ) ENTÃO "CUT" ELSE FALTANDO FIM ) AS CUT DE dx AS d2 ORDEM BY CUT ASC, d2.T ASC |
Task 3: Determine the winners.
We need to rank the players based on the total score to determine who won the tournament. The rankings are skipped over if there are ties in the scores. Doing this in SQL without window functions is expensive. Here, we write the query using the RANK() window function. Window functions are a feature in N1QL in the upcoming release (Mad-Hatter)
Consulta 3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
COM dy AS ( SELECIONAR pl.hnum AS holedn ,pl.ps.Nat AS país ,(pl.ps.FN || " " || pl.ps.LN) AS nome ,pl.ps.ID AS ID ,comprimento da matriz(hps.Sks) AS pontuação ,hpl.hole AS `hole` ,hpl.dia AS `dia` DE ( SELECIONAR meta(usopen).id AS hnum ,ps DE usopen USO chaves "holes:1:1" não registrado Ps AS ps ) pl INNER JUNTAR ( SELECIONAR TONUMBER(dividir(meta(usopen).id, ":") [1]) AS `hole` ,TONUMBER(dividir(meta(usopen).id, ":") [2]) AS `dia` ,hps DE usopen não registrado Rs AS rs não registrado rs.Hs AS hs não registrado hs.HPs AS hps ) hpl ON (pl.ps.ID = hps.ID) ) ,dx AS ( SELECIONAR d.nome ,soma(CASO QUANDO d.dia = 1 ENTÃO d.pontuação ELSE 0 FIM) R1 ,soma(CASO QUANDO d.dia = 2 ENTÃO d.pontuação ELSE 0 FIM) R2 ,soma(CASO QUANDO d.dia = 3 ENTÃO d.pontuação ELSE 0 FIM) R3 ,soma(CASO QUANDO d.dia = 4 ENTÃO d.pontuação ELSE 0 FIM) R4 ,soma(d.pontuação) T DE dy AS d GRUPO BY d.nome ORDEM BY d.nome ) SELECIONAR d2.nome ,d2.R1 ,d2.R2 ,d2.R3 ,d2.R4 ,d2.T ,RANK() SOBRE (ORDEM BY d2.T + CUT) AS Classificação DE dx AS d2 LET CUT = ( CASO QUANDO ( d2.R1 = 0 OU d2.R2 = 0 OU d2.R3 = 0 OU d2.R4 = 0 ) ENTÃO 1000 ELSE 0 FIM ) ORDEM BY Classificação |
Notice the ranks 4, 8, 9, 10, 11 missing because of the tie scores!
Task 4: Now, let’s find out how each player fared after round1, round2, round3 compared to the final round. Using the window functions, it becomes as easy making the marshmallows covered with chocolate disappear.
Consulta 4: Use the same RANK() function, by ORDER BY the score of each day (day1, day1+day2, day1+day2+day3) instead of just the final score.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
COM dy AS ( SELECIONAR pl.hnum AS holedn, pl.ps.Nat AS país,(pl.ps.FN || " " || pl.ps.LN) AS nome, pl.ps.ID AS ID, comprimento da matriz(hps.Sks) AS pontuação, hpl.hole AS `hole`, hpl.dia AS `dia` DE ( SELECIONAR meta(usopen).id AS hnum, ps DE usopen USO chaves "holes:1:1" não registrado Ps AS ps ) pl INNER JUNTAR ( SELECIONAR TONUMBER(dividir(meta(usopen).id, ":") [1]) AS `hole`, TONUMBER(dividir(meta(usopen).id, ":") [2]) AS `dia`, hps DE usopen não registrado Rs AS rs não registrado rs.Hs AS hs não registrado hs.HPs AS hps ) hpl ON (pl.ps.ID = hps.ID) ), dx AS ( SELECIONAR d.nome, soma( CASO QUANDO d.dia = 1 ENTÃO d.pontuação ELSE 0 FIM ) R1, soma( CASO QUANDO d.dia = 2 ENTÃO d.pontuação ELSE 0 FIM ) R2, soma( CASO QUANDO d.dia = 3 ENTÃO d.pontuação ELSE 0 FIM ) R3, soma( CASO QUANDO d.dia = 4 ENTÃO d.pontuação ELSE 0 FIM ) R4, soma(d.pontuação) T DE dy AS d GRUPO BY d.nome ORDEM BY d.nome ) SELECIONAR d2.nome, d2.R1, d2.R2, d2.R3, d2.R4, d2.T, DENSE_RANK() SOBRE ( ORDEM BY d2.T + CUT ) AS rankMoney, RANK() SOBRE ( ORDEM BY d2.T + CUT ) AS rankFinal, RANK() SOBRE ( ORDEM BY d2.R1 ) AS round1rank, RANK() SOBRE ( ORDEM BY d2.R1 + d2.R2 ) AS round2rank, RANK() SOBRE ( ORDEM BY d2.R1 + d2.R2 + d2.R3 + CUT ) AS round3rank DE dx AS d2 LET CUT = ( CASO QUANDO ( d2.R1 = 0 OU d2.R2 = 0 OU d2.R3 = 0 OU d2.R4 = 0 ) ENTÃO 1000 ELSE 0 FIM ) ORDEM BY rankFinal, round1rank, round2rank, round3rank |
Now you can see how the players moved up or down each day.
Task 5: Create the full scorecard for the leader using the basic shot-by-shot statistics.
Consulta 5: Brooks Koepka is the final winner of the US open. Let’s get his scores, hole by hole and get the cumulative scores for him by round. Notice how the simple SUM() and the COUNT() aggregate works as a window function with the OVER() clause.
1 |
SUM(d2.pontuação) SOBRE (PARTIÇÃO BY d2.dia ORDEM BY d2.hole) hst |
This first partitions the score by day and then by hole – specified by PARTITION BY clause, in the order of the holes 1-18. The SUM then adds up the scores so far.
1 |
SUM(d3.pontuação) SOBRE (ORDEM BY d3.dia,d3.hole) ToTScore |
This SUM() function simply adds up the score score from day 1, hole 1 to day 4, hole 18 — this is specified by the ORDER BY d3.day, d3.hole within the OVER() clause.. The field ToTScore shows the total shorts for the tournament by Koepka at each hole.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
COM dy AS ( SELECIONAR pl.hnum AS holedn, pl.ps.Nat AS país,(pl.ps.FN || " " || pl.ps.LN) AS nome, pl.ps.ID AS ID, comprimento da matriz(hps.Sks) AS pontuação, hpl.hole AS `hole`, hpl.dia AS `dia`, hpl.Par AS Par DE ( SELECIONAR meta(usopen).id AS hnum, ps DE usopen USO chaves "holes:1:1" não registrado Ps AS ps ONDE ps.LN = "Koepka" ) pl INNER JUNTAR ( SELECIONAR TONUMBER(dividir(meta(usopen).id, ":") [1]) AS `hole`, TONUMBER(dividir(meta(usopen).id, ":") [2]) AS `dia`, hs.Par, hps DE usopen não registrado Rs AS rs não registrado rs.Hs AS hs não registrado hs.HPs AS hps ) hpl ON (pl.ps.ID = hps.ID) ), dx AS ( SELECIONAR d.nome, d.dia, d.pontuação, d.hole, d.Par DE dy AS d ORDEM BY d.nome ), dz AS ( SELECIONAR d2.dia, d2.hole, d2.pontuação, SUM(d2.pontuação) SOBRE ( PARTIÇÃO BY d2.dia ORDEM BY d2.hole ) hst, d2.Par, SUM(d2.Par) SOBRE ( PARTIÇÃO BY d2.dia ORDEM BY d2.hole ) hpr DE dx AS d2 LET CUT = ( CASO QUANDO ( d2.R1 = 0 OU d2.R2 = 0 OU d2.R3 = 0 OU d2.R4 = 0 ) ENTÃO 1000 ELSE 0 FIM ) ORDEM BY d2.dia, d2.hole ) SELECIONAR d3.Par, d3.dia, d3.hole, d3.hst, d3.pontuação,(d3.hst - d3.hpr) ToPar, soma(d3.pontuação) SOBRE ( ORDEM BY d3.dia, d3.hole ) ToTScore, contagem(1) SOBRE ( ORDEM BY d3.dia, d3.hole ) HoleNum DE dz AS d3 |