Optimized my Analytics N1QL query

lyndondonz · September 2, 2020, 1:45am

Hi Couchbase Gurus,

Need your help badly, for optimizing my Analytics Query.

Information:
So, I have these number of documents in my bucket, which I use in my query:
product-aggregate: 1959780
product: 651
// unique product documents.

[
   {
     Id: product::0010,
     name: ProductName01 
     productCode: 0010,
     type: product 
   },
...
]

// product-aggregate document

[
   {
     product: product::0010,
     qtyShipped: 5,
     sales: 65.89,
     type: ‘product-aggregate’ 
   },
   {
     product: product::0010,
     qtyShipped: 15,
     sales: 75.50,
     type: ‘product-aggregate’ 
   },
   {
     product: product::0011,
     qtyShipped: 11,
     sales: 12.45,
     type: ‘product-aggregate’ 
   },
...
]

Query Execution Time:

The Query:

SELECT
  META(p).id,
  p.name, p.productCode, p.rank, p.unit, p.price,
  totals.totalSales AS totalSales,
  totals.totalShipped AS totalShipped
FROM `company_analytics`.`products` AS p
LET totals = (
  SELECT SUM(pa.sales) AS totalSales, SUM(pa.qtyShipped) AS totalShipped
  FROM `company_analytics`.`product-aggregates` AS pa
  WHERE META(p).id = pa.product
  AND pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'
  AND pa.entityId IN [ 
    "territory::0000",
    "territory::0001",
    "territory::0002",
    "territory::0003",
    "territory::0004",
    "territory::0005",
   … 40+ more
  ]  
)[0]
ORDER BY totalShipped DESC
LIMIT 50

dmitry.lychagin · September 2, 2020, 4:50am

Hi @lyndondonz,

Try the following query instead:

SELECT
  META(p).id,
  p.name, p.productCode, p.rank, p.unit, p.price,
  totals.*
FROM `company_analytics`.`products` AS p 
LEFT OUTER JOIN  (
  SELECT pa.product, SUM(pa.sales) AS totalSales, SUM(pa.qtyShipped) AS totalShipped
  FROM `company_analytics`.`product-aggregates` AS pa
  WHERE (pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z')
  AND pa.entityId IN [ 
    "territory::0000",
    "territory::0001",
    "territory::0002",
    "territory::0003",
    "territory::0004",
    "territory::0005"
     ...
  ]  
  GROUP BY pa.product
) AS totals ON META(p).id = totals.product
LIMIT 50

First run it without any indexes. Let’s see how long it takes.
Then try creating the index on periodAt:

create index idx_periodAt on `company_analytics`.`product-aggregates`(periodAt: string);

Run the query again.
Alternatively you could try the index on entityId:

drop index `company_analytics`.`product-aggregates`.idx_periodAt;
create index idx_entityId on `company_analytics`.`product-aggregates`(entityId: string);

It’d be helpful to know how many documents match each predicate:

periodAt BETWEEN ‘2020-01-01T06:00:00.000Z’ AND ‘2020-08-28T23:25:53.178Z’
entityId IN [ … 40+… ]

lyndondonz · September 2, 2020, 6:38am

Hi @dmitry.lychagin,

Sorry for the late response. So, I’v you use and run you query.

Without Indexes:

With periodAt Index Only:

With entityId Index Only:
QUERY_EXECUTION_WITH_ENTITYID_INDEX_ONLY

**With both entityId and periodAt Index **:

periodAt BETWEEN ‘2020-01-01T06:00:00.000Z’ AND ‘2020-08-28T23:25:53.178Z’
entityId IN [ … 40+… ]

Both of them have the same results because they are JOIN with products with 651 unique documents.

dmitry.lychagin · September 2, 2020, 8:09pm

Hi @lyndondonz,

I just want to clarify that for the last two I meant the following queries:

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'

and

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.entityId IN [ ... 40+ ... ]

I’d run each without any indexes.

Overall, it looks like the first query (full scan) is the fastest. Is adding hardware an option? It would help with improving its performance.

lyndondonz · September 2, 2020, 10:17pm

Hi @dmitry.lychagin

thank you for your response appreciated, Here’s the doc count you ask:

SELECT COUNT(*) 
FROM `bluebonnet_analytics`.`product-aggregates` AS pa
WHERE pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'

Doc Count: 900110

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.entityId IN [ ... 40+ ... ]

Doc Count: 487266

dmitry.lychagin · September 2, 2020, 10:57pm

With these numbers the full scan query (the first query) is faster.
If there’s an index on periodAt then we need to fetch ~45% of the documents using it ( 900110/ 1959780). If there’s and index on entityId then we fetch ~ 24% of the documents (487266/1959780). In both cases the full scan query would have a better performance.

lyndondonz · September 3, 2020, 1:05am

Hi @dmitry.lychagin,

Sorry I was not able to absorb all, can you elaborate more on the approach please.

Thank you.

dmitry.lychagin · September 3, 2020, 5:02pm

Hi @lyndondonz,

I was trying to say that having indexes is not always beneficial for query performance. It looks like in your case the query that I sent earlier needs to run without indexes to get the best performance.

Suresh_Kumar · September 10, 2020, 6:24am

Hi ,
How did you get all these metrics? Which version you were using? I’m getting only few metrics.

mhubail · September 10, 2020, 8:06pm

Hi @Suresh_Kumar,
The metrics you pointed to are the metrics available in the Query Service UI, whereas the ones @lyndondonz mentioned are available in the Analytics Service UI.

Topic		Replies	Views
Optimizing analytics query with optional filters Analytics index	5	1550	May 29, 2020
How to optimize query performance which aggregate millions of rows SQL++ data_modelling , n1ql , index	3	2520	April 24, 2017
Analytics query perfomance Couchbase Server query , n1ql	2	73	February 14, 2025
How to GROUP BY and Filter using the Array values in Analytics N1QL query Analytics query , n1ql , analytics	12	2086	October 1, 2020
Group By Type slow Analytics	3	428	November 15, 2023

Optimized my Analytics N1QL query

Related topics