Improving query with multiple OR clauses

krishan.jangid · February 6, 2018, 4:08pm

Hi, following query is used for a search feature on UI which works as follows:

Exact match with entered string on cardNumber
leading characters match on firstName and lastName

Query:

SELECT count(*) FROM reporting 
WHERE type = "Person" 
AND
 (LOWER(cardNumber) = LOWER("a")
OR (LOWER(firstName) like "a%" 
OR LOWER(lastName) like "a%"))
;

This query is taking approx. 5 to 6 seconds with 20k documents with ‘type’ index. When I created two separate indexes (one on CardNumber and the other for firstName and lastName) the execution time increased to 7-8 seconds.
Can you suggest any index which can make it faster?

vsr1 · February 6, 2018, 6:09pm

If cardNumber is string of 1 character and you are looking for same first character. Try this

CREATE INDEX ix1 ON reporting 
(DISTINCT ARRAY v FOR v IN [LOWER(cardNumber),LOWER(firstName),LOWER(lastName)] END) 
WHERE type = "Person";

SELECT COUNT(1) FROM reporting
WHERE   type = "Person" AND
ANY v IN  [LOWER(cardNumber),LOWER(firstName),LOWER(lastName)]  SATISFIES v LIKE "a%" END;

If you want to make covered index add cardNumber,firstName,lastName as index keys at the end

krishan.jangid · February 7, 2018, 6:38am

@vsr1, cardNumber is an alphanumeric string of variable length. Whatever the user enters in the search box, will be matched against cardNumber(equals operatior) and Name(like operator).

So the query that I am looking for should be like-

SELECT COUNT(1) FROM 
reporting 
WHERE rgm.type = "Person" 
AND (LOWER(cardNumber) = LOWER("a")
OR ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END
);

Which is still taking around 4 seconds with index:

CREATE INDEX ix1 ON reporting(lower(cardNumber),distinct (array v for v in [lower(firstName), lower(lastName)] end)) WHERE (type = "Person")

vsr1 · February 7, 2018, 2:28pm

OR is tricky because it might be missing .

CREATE INDEX ix1 ON reporting(lower(cardNumber)) WHERE (type = "Person");
CREATE INDEX ix2 ON reporting(distinct array v for v in [lower(firstName), lower(lastName)] end,firstName, lastName ) WHERE (type = "Person");

SELECT COUNT(1) FROM 
reporting 
WHERE rgm.type = "Person" 
AND (LOWER(cardNumber) = LOWER("a")
OR ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END
);

OR

SELECT ARRAY_COUNT(av) LET av = (SELECT META().id FROM reporting 
WHERE rgm.type = "Person" 
AND LOWER(cardNumber) = LOWER("a") UNION SELECT META().id FROM reporting 
WHERE rgm.type = "Person"  ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END);

Do EXPLAIN on subquery and see if it is covered if not make it cover.

krishan.jangid · February 8, 2018, 6:11pm

Hi @vsr1, I eventually had to go with the first query that you suggested. I am using two queries here. One for Count, and the other for selecting specific list of attributes.
When I created a covering index for the select query, (which automatically covers count query as well) and executed it on two different machines with same Hardware configuration (16GB RAM and 4 Core CPU) but different versions of couchbase server, I found a huge difference in the query performance.

Machine 1 is using CB 5.0.1 and Machine-2 is using CB Analytics DP4. Machine 1 is executing this query in 10 seconds, and machine 2 is executing it within 1.5 seconds. the reporting bucket on both machine is allocated with 2 GB of RAM and and index service has 512 MB RAM allocated.

Also, the slower machine has 5 more buckets other than reporting (with 1 GB RAM each) and the machine which is showing better performance has only one bucket on which I am working.
I compared the explain plan for both machines as well and didn’t find any difference in it.

Index:

CREATE INDEX IX_reporting_Person ON reporting((distinct (array n for n in [lower(reservationNumber), lower(firstName), lower(lastName)] end)),verificationStatus,lastPendingDocumentDate,startDate,isInternational,isOcrDirty,endDate,firstName,middleName,lastName,personId,securityPhotoMediaItemId,cardNumber,isReviewLater,verifiedBy) WHERE (type = "Person")

Data query:

Select case When r.verificationStatus = 'PENDING' and (date_diff_str(clock_local() , r.lastPendingDocumentDate, 'day') >= 4 OR date_diff_str(embarkDate, r.lastPendingDocumentDate, 'day') <= 5) Then 'OVERDUE' ELSE r.verificationStatus end as status, r.startDate,r.isInternational,r.isOcrDirty,r.endDate,r.firstName, r.middleName, r.lastName, r.personId, r.securityPhotoMediaItemId,r.cardNumber,r.isReviewLater, r.verifiedBy From reporting r where r.type = 'Person' AND meta(r).id NOT LIKE "_sync%" AND (ANY x IN [LOWER(r.cardNumber),LOWER(r.firstName), LOWER(r.lastName)] SATISFIES x like LOWER("b%") END) order by meta(r).id limit 20 offset 0;

Time taken on both machines:

On Machine-1 greater than 10 s (with approx. 0.2 M records in the bucket)
On Machine-2 within 1.5 s (with 0.3 M documents in the bucket)

Explain Plan:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                "cover ((distinct (array `n` for `n` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] end)))",
                "cover ((`r`.`verificationStatus`))",
                "cover ((`r`.`lastPendingDocumentDate`))",
                "cover ((`r`.`startDate`))",
                "cover ((`r`.`isInternational`))",
                "cover ((`r`.`isOcrDirty`))",
                "cover ((`r`.`endDate`))",
                "cover ((`r`.`firstName`))",
                "cover ((`r`.`middleName`))",
                "cover ((`r`.`lastName`))",
                "cover ((`r`.`personId`))",
                "cover ((`r`.`securityPhotoMediaItemId`))",
                "cover ((`r`.`cardNumber`))",
                "cover ((`r`.`isReviewLater`))",
                "cover ((`r`.`verifiedBy`))",
                "cover ((meta(`r`).`id`))"
              ],
              "filter_covers": {
                "cover ((`r`.`type`))": "Person",
                "cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies ((\"b\" <= `x`) and (`x` < \"c\")) end)": true,
                "cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies (`x` like lower(\"b%\")) end)": true
              },
              "index": "IX_reporting_Person",
              "index_id": "2b071909296788d4",
              "index_projection": {
                "entry_keys": [
                  1,
                  3,
                  4,
                  5,
                  6,
                  8,
                  9,
                  10,
                  11,
                  12,
                  13,
                  14,
                  15,
                  16,
                  17
                ],
                "primary_key": true
              },
              "keyspace": "reporting",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "\"c\"",
                      "inclusion": 1,
                      "low": "\"b\""
                    }
                  ]
                }
              ],
              "using": "gsi"
            }
          },
          {
            "#operator": "Parallel",
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "(((cover ((`r`.`type`)) = \"Person\") and (not (cover ((meta(`r`).`id`)) like \"_sync%\"))) and cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies (`x` like lower(\"b%\")) end))"
                },
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "as": "status",
                      "expr": "case when ((cover ((`r`.`verificationStatus`)) = \"PENDING\") and ((4 <= date_diff_str(clock_str(), cover ((`r`.`lastPendingDocumentDate`)), \"day\")) or (date_diff_str(cover ((`r`.`startDate`)), cover ((`r`.`lastPendingDocumentDate`)), \"day\") <= 5))) then \"OVERDUE\" else cover ((`r`.`verificationStatus`)) end"
                    },
                    {
                      "expr": "cover ((`r`.`startDate`))"
                    },
                    {
                      "expr": "cover ((`r`.`isInternational`))"
                    },
                    {
                      "expr": "cover ((`r`.`isOcrDirty`))"
                    },
                    {
                      "expr": "cover ((`r`.`endDate`))"
                    },
                    {
                      "expr": "cover ((`r`.`firstName`))"
                    },
                    {
                      "expr": "cover ((`r`.`middleName`))"
                    },
                    {
                      "expr": "cover ((`r`.`lastName`))"
                    },
                    {
                      "expr": "cover ((`r`.`personId`))"
                    },
                    {
                      "expr": "cover ((`r`.`securityPhotoMediaItemId`))"
                    },                    
                    {
                      "expr": "cover ((`r`.`cardNumber`))"
                    },
                    {
                      "expr": "cover ((`r`.`isReviewLater`))"
                    },
                    {
                      "expr": "cover ((`r`.`verifiedBy`))"
                    }
                  ]
                }
              ]
            }
          }
        ]
      },
      {
        "#operator": "Order",
        "limit": "20",
        "sort_terms": [
          {
            "expr": "cover ((meta(`r`).`id`))"
          }
        ]
      },
      {
        "#operator": "Limit",
        "expr": "20"
      },
      {
        "#operator": "FinalProject"
      }
    ]
  },
  "text": "Select case When r.verificationStatus = 'PENDING' and (date_diff_str(clock_local() , r.lastPendingDocumentDate, 'day') >= 4 OR date_diff_str(startDate , r.lastPendingDocumentDate, 'day') <= 5) Then 'OVERDUE' ELSE r.verificationStatus end as status,\nr.startDate,r.isInternational,r.isOcrDirty,r.endDate,r.firstName, r.middleName, r.lastName, r.personId, r.securityPhotoMediaItemId, r.cardNumber,r.isReviewLater, r.verifiedBy \nFrom reporting r\nwhere r.type = 'Person'\nAND meta(r).id NOT LIKE \"_sync%\" \nAND (ANY x IN [LOWER(r.cardNumber),LOWER(r.firstName), LOWER(r.lastName)] SATISFIES x like LOWER(\"b%\") END)\norder by meta(r).id limit 20 offset 0;"
}

Is DP4 really that faster than Non-Analytics version of couchbase Server? or it might be something else that I am missing. Please let me know.

vsr1 · February 8, 2018, 6:22pm

You are using aggregate query and Analytical is designed for that it does parallel. cc @till, @keshav_m

krishan.jangid · February 8, 2018, 6:25pm

Is It supposed to improve N1QL performance as well? I was thinking SQL++ is faster and N1QL will give same performance for both versions of couchbase.

krishan.jangid · February 8, 2018, 6:29pm

And if these two are doing things differently, then shouldn’t it be reflected in the explain plan as well? The exact same explain plan for both confused me a bit.

keshav_m · February 8, 2018, 6:31pm

If the two installations have exact same plan (and exact version of the Couchbase server), the only difference is hardware, type of the index, etc. Check for which type of index you have on both.

krishan.jangid · February 8, 2018, 6:35pm

@keshav_m , the machine 1 has Couchbase 5.0.1-5003-enterprise and Machine 2 has couchbase 5.0.0-837-enterprise (Analytics DP4). Both have Standard Global Secondary Indexes and hardware configuration for both machines is same.

vsr1 · February 8, 2018, 6:45pm

5.0.0-837-enterprise (Analytics DP4)
Is this Analytical service.

krishan.jangid · February 8, 2018, 6:59pm

I am using N1QL query service on this one as well. Not Analytics query service

vsr1 · February 8, 2018, 7:07pm

Enable profile (https://blog.couchbase.com/optimize-n1ql-performance-using-request-profiling/) and check which operator taking more time on both the machines

krishan.jangid · February 9, 2018, 6:43am

Machine 1:

   "phaseCounts": {
        "indexScan": 38492,
        "sort": 36853
    },
    "phaseOperators": {
        "authorize": 1,
        "indexScan": 1,
        "sort": 2
    },
    "phaseTimes": {
        "authorize": "37.051862ms",
        "indexScan": "8.407853868s",
        "instantiate": "69.229µs",
        "parse": "3.21491ms",
        "plan": "50.447944ms",
        "run": "9.795408322s",
        "sort": "233.726781ms"
    }

Machine 2:

  "phaseCounts": {
        "indexScan": 56191,
        "sort": 53728
    },
    "phaseOperators": {
        "authorize": 1,
        "indexScan": 1,
        "sort": 2
    },
   "phaseTimes": {
            "authorize": "631.49µs",
            "indexScan": "1.269954302s",
            "instantiate": "62.986µs",
            "parse": "3.239395ms",
            "plan": "1.373459ms",
            "run": "1.397644393s",
            "sort": "94.06877ms"
        }

krishan.jangid · February 9, 2018, 7:49am

My bad. Machine 1 (5.0.1) is using Legacy Global Secondary Index. (Didn’t even know this type of index existed!). How is this different than Standard Global Secondary Indexes?

vsr1 · February 9, 2018, 8:28am

Legacy Global Secondary Index is old forestdb, due to upgrade from old server
Standard Global Secondary Indexes is plasma

krishan.jangid · February 9, 2018, 8:30am

So I guess this shouldn’t be causing the issue as well.

vsr1 · February 9, 2018, 8:32am

That is the reason of performance differance, forestdb is very slow, plasma is much faster Using the New Plasma Storage Engine for Couchbase Indexes cc @deepkaran.salooja .

vsr1 · February 9, 2018, 8:43am

This is unrelated. Did you tried this index.
Also add escape charter for _ in like   meta().id NOT LIKE "\\_sync%" 

 

CREATE INDEX IX_reporting_Person ON reporting((distinct (array n for n in [lower(reservationNumber), lower(firstName), lower(lastName)] end)),verificationStatus,lastPendingDocumentDate,startDate,isInternational,isOcrDirty,endDate,firstName,middleName,lastName,personId,securityPhotoMediaItemId,cardNumber,isReviewLater,verifiedBy) WHERE (type = "Person") and meta().id NOT LIKE "_sync%"

krishan.jangid · February 9, 2018, 8:45am

the escape character should be in main query or in index query?

Topic		Replies	Views
N1QL slow compare to MySQL SQL++	21	2725	January 14, 2017
Simple user search takes 25+ seconds to execute on large cluster with fairly small data set SQL++	13	3280	June 1, 2016
N1QL query slow Couchbase Server n1ql , index	32	4724	November 7, 2016
N1QL Index with an OR statement SQL++	13	3274	April 9, 2016
Performance Issue N1QL Query on 200M major bucket volume SQL++	36	2180	January 4, 2020

Improving query with multiple OR clauses

Related topics