Help with Analytics Query - Optimize and Index

Hi,

I have the below query, which gives me the results I need. The problem is that the query runs for 6+ secs. My intention is to try and re-write the query to make it more efficient and use indexes where possible.

 WITH available_cb_docs AS ( WITH available_daily_docs AS (
        SELECT RAW td.DocNum
        FROM tranData td
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
        UNION ALL
        SELECT RAW mch.DocNum
        FROM tranData td
        UNNEST td.ActvCustOrdr AS mch
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    )
    SELECT DISTINCT raw DocNum
    FROM tranData
    WHERE DocNum IN (available_daily_docs)and `$MdfdTmstmp` between "{previous_date}" and "{report_end_date}") ,
    available_daily_report_docs AS (
    SELECT RAW td.DocNum
    FROM tranData td
    WHERE td.`$MdfdTmstmp` BETWEEN  "{report_date}" AND "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    UNION ALL
    SELECT RAW mch.DocNum
    FROM tranData td
    UNNEST td.ActvCustOrdr AS mch
    WHERE td.`$MdfdTmstmp` BETWEEN  "{report_date}" AND "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    )
    SELECT array_symdiff(available_daily_report_docs, available_cb_docs) AS cust_order_missing_docs;

One of the things I want to do is replace the Union query with a Variable so it can be executed only one. The problem I run into is with the DocNum IN (available_daily_docs) part. For whatever reason, whenever there is an IN clause, Analytics does not want to pick the index defined on DocNum. If I do DocNum = "123", the index is used and response is in milliseconds. But if I do DocNum in ["123", "456"], it takes 6 secs and does not use the index.

Any suggestion on optimizing this?

I am not analytic expert. Checkout Analytics Index with IN - #2 by dmitry.lychagin

Can you rewrite as follows without repeatedly scan of the data?

WITH available_report_docs AS (
        SELECT td.DocNum, td.`$MdfdTmstmp` AS timestamp
        FROM tranData td
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
        UNION ALL
        SELECT mch.DocNum, td.`$MdfdTmstmp` AS timestamp
        FROM tranData td
        UNNEST td.ActvCustOrdr AS mch
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    ),
    available_cb_docs AS ( SELECT DISTINCT RAW c.DocNum
                           FROM available_report_docs AS c
                           WHERE c.timestamp BETWEEN "{previous_date}" and "{report_end_date}"),
    available_daily_report_docs AS ( SELECT DISTINCT RAW c.DocNum FROM available_report_docs AS c)
    SELECT ARRAY_SYMDIFF(available_daily_report_docs, available_cb_docs) AS cust_order_missing_docs;

adjust based on which date is earliest {previous_date} or “{report_date}”
If Analytic supports EXCEPT see if works for you

...............
SELECT RAW c1.DocNum FROM available_report_docs AS c1
 EXCEPT
 SELECT RAW c.DocNum FROM available_report_docs AS c WHERE c.timestamp BETWEEN "{previous_date}" and "{report_end_date}"

Hello Amrish:

Can you please let us know what version you are using.

Also, I tried a simple experiment in house.

select count () from dataset where col in [1];
and
select count (
) from dataset where col in [1, 2, 3];

In both cases an index plan was picked.

This is how I created the index.

create index index1 on dataset (col:int);

Please forward your create index statement.

Murali.

Hi Murali,

We are on CB 6.6.4 EE.

Index is defined as

missing_doc_ix (RteId:string)
($Type:string)
($MdfdTmstmp:string)

Below query picks up index as expected

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId in ["91940"]
limit 5;

Note the IN cause for RteId but with only 1 item in the IN Array. Returned the result in ~50ms

 "expressions": ["index-search(\"missing_doc_ix\", 0, \"Default\", \"tranData\", FALSE, FALSE, 2, $$31, $$32, 2, $$33, $$34, TRUE, TRUE, TRUE)"
                                                                          ],

Now, if I run the same query again with multiple items in the IN clause. I just copied the same route again in the IN

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId in ["91940","91940"]
limit 5;

Query ran for 10.15s and still got the 2 docs back. No index used. Happy to log a support ticket if that will help with the investigation. Bottom line for us is that I have never ever been able to get Analytics indexes to work with an IN clause.

Please create an additional index on only that one field – RteId.

@amrish The issue you encountered has been fixed in CB 7.1.0.

Starting 7.1.0, prefix search on secondary indexes has been enabled. This means that when you have the below index:

CREATE INDEX missing_doc_ix ON tranData(RteId:string, $Type:string, $MdfdTmstmp:string)

The missing_doc_ix index can be used to answer queries that have a prefix of the index in their predicate. For example,

SELECT `$MdfdTmstmp` FROM tranData WHERE RteId = "91940";

Before 7.1.0, in order for a secondary index to be picked up, all the fields need to be in the predicate. For example,

SELECT `$MdfdTmstmp`
FROM tranData
WHERE `$Type` = "DailyReport" AND `$MdfdTmstmp` > "2021-06-07" AND RteId = "91940";

Now for your issue with the below query:

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId IN ["91940","91940"]
limit 5;

Even though your query has all the fields in the predicate, the compiler still tries to do a prefix search using only the RteId on the secondary index due to how the IN clause is handled, but the compiler cannot use the secondary index because prefix search is disabled before 7.1.0.

As to why the below query picks up the secondary index:

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId in ["91940"]
limit 5;

That is because it is equivalent to the below query where all the fields are present in the predicate:

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId = "91940"
limit 5;
1 Like

Thanks for letting me know. I’ll take 7.1.1 for a spin. We do intend to upgrade to 7 but I guess we’ll have to do it sooner than later if this issue is resolved in 7.1.x. Thanks for looking into this.