Optimized my Analytics N1QL query

Hi Couchbase Gurus,

Need your help badly, for optimizing my Analytics Query.

Information:
So, I have these number of documents in my bucket, which I use in my query:
product-aggregate: 1959780
product: 651
// unique product documents.

[
   {
     Id: product::0010,
     name: ProductName01 
     productCode: 0010,
     type: product 
   },
...
]

// product-aggregate document

[
   {
     product: product::0010,
     qtyShipped: 5,
     sales: 65.89,
     type: ‘product-aggregate’ 
   },
   {
     product: product::0010,
     qtyShipped: 15,
     sales: 75.50,
     type: ‘product-aggregate’ 
   },
   {
     product: product::0011,
     qtyShipped: 11,
     sales: 12.45,
     type: ‘product-aggregate’ 
   },
...
]

Query Execution Time:
Query_Exec_Time

The Query:

SELECT
  META(p).id,
  p.name, p.productCode, p.rank, p.unit, p.price,
  totals.totalSales AS totalSales,
  totals.totalShipped AS totalShipped
FROM `company_analytics`.`products` AS p
LET totals = (
  SELECT SUM(pa.sales) AS totalSales, SUM(pa.qtyShipped) AS totalShipped
  FROM `company_analytics`.`product-aggregates` AS pa
  WHERE META(p).id = pa.product
  AND pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'
  AND pa.entityId IN [ 
    "territory::0000",
    "territory::0001",
    "territory::0002",
    "territory::0003",
    "territory::0004",
    "territory::0005",
   … 40+ more
  ]  
)[0]
ORDER BY totalShipped DESC
LIMIT 50

Hi @lyndondonz,

Try the following query instead:

SELECT
  META(p).id,
  p.name, p.productCode, p.rank, p.unit, p.price,
  totals.*
FROM `company_analytics`.`products` AS p 
LEFT OUTER JOIN  (
  SELECT pa.product, SUM(pa.sales) AS totalSales, SUM(pa.qtyShipped) AS totalShipped
  FROM `company_analytics`.`product-aggregates` AS pa
  WHERE (pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z')
  AND pa.entityId IN [ 
    "territory::0000",
    "territory::0001",
    "territory::0002",
    "territory::0003",
    "territory::0004",
    "territory::0005"
     ...
  ]  
  GROUP BY pa.product
) AS totals ON META(p).id = totals.product
LIMIT 50

First run it without any indexes. Let’s see how long it takes.
Then try creating the index on periodAt:

create index idx_periodAt on `company_analytics`.`product-aggregates`(periodAt: string);

Run the query again.
Alternatively you could try the index on entityId:

drop index `company_analytics`.`product-aggregates`.idx_periodAt;
create index idx_entityId on `company_analytics`.`product-aggregates`(entityId: string);

It’d be helpful to know how many documents match each predicate:

  1. periodAt BETWEEN ‘2020-01-01T06:00:00.000Z’ AND ‘2020-08-28T23:25:53.178Z’
  2. entityId IN [ … 40+… ]

Hi @dmitry.lychagin,

Sorry for the late response. So, I’v you use and run you query.

Without Indexes:
QUERY_EXECUTION_WITHOUT_INDEX

With periodAt Index Only:
QUERY_EXECUTION_WITH_PERIODAT_INDEX_ONLY

With entityId Index Only:
QUERY_EXECUTION_WITH_ENTITYID_INDEX_ONLY

**With both entityId and periodAt Index **:
QUERY_EXECUTION_WITH_BOTH

  1. periodAt BETWEEN ‘2020-01-01T06:00:00.000Z’ AND ‘2020-08-28T23:25:53.178Z’
    651_DOCS_RETURN_ON_PERIOD_AT
  2. entityId IN [ … 40+… ]
    651_WITH_ENTITY_ID
    Both of them have the same results because they are JOIN with products with 651 unique documents.

Hi @lyndondonz,

I just want to clarify that for the last two I meant the following queries:

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'  

and

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.entityId IN [ ... 40+ ... ]

I’d run each without any indexes.

Overall, it looks like the first query (full scan) is the fastest. Is adding hardware an option? It would help with improving its performance.

Hi @dmitry.lychagin

thank you for your response appreciated, Here’s the doc count you ask:

SELECT COUNT(*) 
FROM `bluebonnet_analytics`.`product-aggregates` AS pa
WHERE pa.periodAt BETWEEN '2020-01-01T06:00:00.000Z' AND '2020-08-28T23:25:53.178Z'

Doc Count: 900110

SELECT COUNT(*) 
FROM `company_analytics`.`product-aggregates` AS pa
WHERE pa.entityId IN [ ... 40+ ... ]

Doc Count: 487266

With these numbers the full scan query (the first query) is faster.
If there’s an index on periodAt then we need to fetch ~45% of the documents using it ( 900110/ 1959780). If there’s and index on entityId then we fetch ~ 24% of the documents (487266/1959780). In both cases the full scan query would have a better performance.

Hi @dmitry.lychagin,

Sorry I was not able to absorb all, can you elaborate more on the approach please.

Thank you.

Hi @lyndondonz,

I was trying to say that having indexes is not always beneficial for query performance. It looks like in your case the query that I sent earlier needs to run without indexes to get the best performance.

Hi ,
How did you get all these metrics? Which version you were using? I’m getting only few metrics.
image

Hi @Suresh_Kumar,
The metrics you pointed to are the metrics available in the Query Service UI, whereas the ones @lyndondonz mentioned are available in the Analytics Service UI.