Improving query with multiple OR clauses

Hi, following query is used for a search feature on UI which works as follows:

  • Exact match with entered string on cardNumber
  • leading characters match on firstName and lastName

Query:

SELECT count(*) FROM reporting 
WHERE type = "Person" 
AND
 (LOWER(cardNumber) = LOWER("a")
OR (LOWER(firstName) like "a%" 
OR LOWER(lastName) like "a%"))
;

This query is taking approx. 5 to 6 seconds with 20k documents with ‘type’ index. When I created two separate indexes (one on CardNumber and the other for firstName and lastName) the execution time increased to 7-8 seconds.
Can you suggest any index which can make it faster?

If cardNumber is string of 1 character and you are looking for same first character. Try this

CREATE INDEX ix1 ON reporting 
(DISTINCT ARRAY v FOR v IN [LOWER(cardNumber),LOWER(firstName),LOWER(lastName)] END) 
WHERE type = "Person";

SELECT COUNT(1) FROM reporting
WHERE   type = "Person" AND
ANY v IN  [LOWER(cardNumber),LOWER(firstName),LOWER(lastName)]  SATISFIES v LIKE "a%" END;

If you want to make covered index add cardNumber,firstName,lastName as index keys at the end

@vsr1, cardNumber is an alphanumeric string of variable length. Whatever the user enters in the search box, will be matched against cardNumber(equals operatior) and Name(like operator).

So the query that I am looking for should be like-

SELECT COUNT(1) FROM 
reporting 
WHERE rgm.type = "Person" 
AND (LOWER(cardNumber) = LOWER("a")
OR ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END
);

Which is still taking around 4 seconds with index:

CREATE INDEX ix1 ON reporting(lower(cardNumber),distinct (array v for v in [lower(firstName), lower(lastName)] end)) WHERE (type = "Person")

OR is tricky because it might be missing .

CREATE INDEX ix1 ON reporting(lower(cardNumber)) WHERE (type = "Person");
CREATE INDEX ix2 ON reporting(distinct array v for v in [lower(firstName), lower(lastName)] end,firstName, lastName ) WHERE (type = "Person");

SELECT COUNT(1) FROM 
reporting 
WHERE rgm.type = "Person" 
AND (LOWER(cardNumber) = LOWER("a")
OR ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END
);

OR

SELECT ARRAY_COUNT(av) LET av = (SELECT META().id FROM reporting 
WHERE rgm.type = "Person" 
AND LOWER(cardNumber) = LOWER("a") UNION SELECT META().id FROM reporting 
WHERE rgm.type = "Person"  ANY v IN [LOWER(firstName),LOWER(lastName)] SATISFIES v LIKE "a%" END);

Do EXPLAIN on subquery and see if it is covered if not make it cover.

Hi @vsr1, I eventually had to go with the first query that you suggested. I am using two queries here. One for Count, and the other for selecting specific list of attributes.
When I created a covering index for the select query, (which automatically covers count query as well) and executed it on two different machines with same Hardware configuration (16GB RAM and 4 Core CPU) but different versions of couchbase server, I found a huge difference in the query performance.

Machine 1 is using CB 5.0.1 and Machine-2 is using CB Analytics DP4. Machine 1 is executing this query in 10 seconds, and machine 2 is executing it within 1.5 seconds. the reporting bucket on both machine is allocated with 2 GB of RAM and and index service has 512 MB RAM allocated.

Also, the slower machine has 5 more buckets other than reporting (with 1 GB RAM each) and the machine which is showing better performance has only one bucket on which I am working.
I compared the explain plan for both machines as well and didn’t find any difference in it.

Index:

CREATE INDEX IX_reporting_Person ON reporting((distinct (array n for n in [lower(reservationNumber), lower(firstName), lower(lastName)] end)),verificationStatus,lastPendingDocumentDate,startDate,isInternational,isOcrDirty,endDate,firstName,middleName,lastName,personId,securityPhotoMediaItemId,cardNumber,isReviewLater,verifiedBy) WHERE (type = "Person")

Data query:

Select case When r.verificationStatus = 'PENDING' and (date_diff_str(clock_local() , r.lastPendingDocumentDate, 'day') >= 4 OR date_diff_str(embarkDate, r.lastPendingDocumentDate, 'day') <= 5) Then 'OVERDUE' ELSE r.verificationStatus end as status, r.startDate,r.isInternational,r.isOcrDirty,r.endDate,r.firstName, r.middleName, r.lastName, r.personId, r.securityPhotoMediaItemId,r.cardNumber,r.isReviewLater, r.verifiedBy From reporting r where r.type = 'Person' AND meta(r).id NOT LIKE "_sync%" AND (ANY x IN [LOWER(r.cardNumber),LOWER(r.firstName), LOWER(r.lastName)] SATISFIES x like LOWER("b%") END) order by meta(r).id limit 20 offset 0;

Time taken on both machines:

On Machine-1 greater than 10 s (with approx. 0.2 M records in the bucket)
On Machine-2 within 1.5 s (with 0.3 M documents in the bucket)

Explain Plan:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                "cover ((distinct (array `n` for `n` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] end)))",
                "cover ((`r`.`verificationStatus`))",
                "cover ((`r`.`lastPendingDocumentDate`))",
                "cover ((`r`.`startDate`))",
                "cover ((`r`.`isInternational`))",
                "cover ((`r`.`isOcrDirty`))",
                "cover ((`r`.`endDate`))",
                "cover ((`r`.`firstName`))",
                "cover ((`r`.`middleName`))",
                "cover ((`r`.`lastName`))",
                "cover ((`r`.`personId`))",
                "cover ((`r`.`securityPhotoMediaItemId`))",
                "cover ((`r`.`cardNumber`))",
                "cover ((`r`.`isReviewLater`))",
                "cover ((`r`.`verifiedBy`))",
                "cover ((meta(`r`).`id`))"
              ],
              "filter_covers": {
                "cover ((`r`.`type`))": "Person",
                "cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies ((\"b\" <= `x`) and (`x` < \"c\")) end)": true,
                "cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies (`x` like lower(\"b%\")) end)": true
              },
              "index": "IX_reporting_Person",
              "index_id": "2b071909296788d4",
              "index_projection": {
                "entry_keys": [
                  1,
                  3,
                  4,
                  5,
                  6,
                  8,
                  9,
                  10,
                  11,
                  12,
                  13,
                  14,
                  15,
                  16,
                  17
                ],
                "primary_key": true
              },
              "keyspace": "reporting",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "\"c\"",
                      "inclusion": 1,
                      "low": "\"b\""
                    }
                  ]
                }
              ],
              "using": "gsi"
            }
          },
          {
            "#operator": "Parallel",
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "(((cover ((`r`.`type`)) = \"Person\") and (not (cover ((meta(`r`).`id`)) like \"_sync%\"))) and cover (any `x` in [lower((`r`.`cardNumber`)), lower((`r`.`firstName`)), lower((`r`.`lastName`))] satisfies (`x` like lower(\"b%\")) end))"
                },
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "as": "status",
                      "expr": "case when ((cover ((`r`.`verificationStatus`)) = \"PENDING\") and ((4 <= date_diff_str(clock_str(), cover ((`r`.`lastPendingDocumentDate`)), \"day\")) or (date_diff_str(cover ((`r`.`startDate`)), cover ((`r`.`lastPendingDocumentDate`)), \"day\") <= 5))) then \"OVERDUE\" else cover ((`r`.`verificationStatus`)) end"
                    },
                    {
                      "expr": "cover ((`r`.`startDate`))"
                    },
                    {
                      "expr": "cover ((`r`.`isInternational`))"
                    },
                    {
                      "expr": "cover ((`r`.`isOcrDirty`))"
                    },
                    {
                      "expr": "cover ((`r`.`endDate`))"
                    },
                    {
                      "expr": "cover ((`r`.`firstName`))"
                    },
                    {
                      "expr": "cover ((`r`.`middleName`))"
                    },
                    {
                      "expr": "cover ((`r`.`lastName`))"
                    },
                    {
                      "expr": "cover ((`r`.`personId`))"
                    },
                    {
                      "expr": "cover ((`r`.`securityPhotoMediaItemId`))"
                    },                    
                    {
                      "expr": "cover ((`r`.`cardNumber`))"
                    },
                    {
                      "expr": "cover ((`r`.`isReviewLater`))"
                    },
                    {
                      "expr": "cover ((`r`.`verifiedBy`))"
                    }
                  ]
                }
              ]
            }
          }
        ]
      },
      {
        "#operator": "Order",
        "limit": "20",
        "sort_terms": [
          {
            "expr": "cover ((meta(`r`).`id`))"
          }
        ]
      },
      {
        "#operator": "Limit",
        "expr": "20"
      },
      {
        "#operator": "FinalProject"
      }
    ]
  },
  "text": "Select case When r.verificationStatus = 'PENDING' and (date_diff_str(clock_local() , r.lastPendingDocumentDate, 'day') >= 4 OR date_diff_str(startDate , r.lastPendingDocumentDate, 'day') <= 5) Then 'OVERDUE' ELSE r.verificationStatus end as status,\nr.startDate,r.isInternational,r.isOcrDirty,r.endDate,r.firstName, r.middleName, r.lastName, r.personId, r.securityPhotoMediaItemId, r.cardNumber,r.isReviewLater, r.verifiedBy \nFrom reporting r\nwhere r.type = 'Person'\nAND meta(r).id NOT LIKE \"_sync%\" \nAND (ANY x IN [LOWER(r.cardNumber),LOWER(r.firstName), LOWER(r.lastName)] SATISFIES x like LOWER(\"b%\") END)\norder by meta(r).id limit 20 offset 0;"
}

Is DP4 really that faster than Non-Analytics version of couchbase Server? or it might be something else that I am missing. Please let me know.

You are using aggregate query and Analytical is designed for that it does parallel. cc @till, @keshav_m

Is It supposed to improve N1QL performance as well? I was thinking SQL++ is faster and N1QL will give same performance for both versions of couchbase.

And if these two are doing things differently, then shouldn’t it be reflected in the explain plan as well? The exact same explain plan for both confused me a bit.

If the two installations have exact same plan (and exact version of the Couchbase server), the only difference is hardware, type of the index, etc. Check for which type of index you have on both.

@keshav_m , the machine 1 has Couchbase 5.0.1-5003-enterprise and Machine 2 has couchbase 5.0.0-837-enterprise (Analytics DP4). Both have Standard Global Secondary Indexes and hardware configuration for both machines is same.

5.0.0-837-enterprise (Analytics DP4)
Is this Analytical service.

I am using N1QL query service on this one as well. Not Analytics query service

Enable profile (https://blog.couchbase.com/optimize-n1ql-performance-using-request-profiling/) and check which operator taking more time on both the machines

Machine 1:

   "phaseCounts": {
        "indexScan": 38492,
        "sort": 36853
    },
    "phaseOperators": {
        "authorize": 1,
        "indexScan": 1,
        "sort": 2
    },
    "phaseTimes": {
        "authorize": "37.051862ms",
        "indexScan": "8.407853868s",
        "instantiate": "69.229µs",
        "parse": "3.21491ms",
        "plan": "50.447944ms",
        "run": "9.795408322s",
        "sort": "233.726781ms"
    }

Machine 2:

  "phaseCounts": {
        "indexScan": 56191,
        "sort": 53728
    },
    "phaseOperators": {
        "authorize": 1,
        "indexScan": 1,
        "sort": 2
    },
   "phaseTimes": {
            "authorize": "631.49µs",
            "indexScan": "1.269954302s",
            "instantiate": "62.986µs",
            "parse": "3.239395ms",
            "plan": "1.373459ms",
            "run": "1.397644393s",
            "sort": "94.06877ms"
        }

My bad. Machine 1 (5.0.1) is using Legacy Global Secondary Index. (Didn’t even know this type of index existed!). How is this different than Standard Global Secondary Indexes?

Legacy Global Secondary Index is old forestdb, due to upgrade from old server
Standard Global Secondary Indexes is plasma

So I guess this shouldn’t be causing the issue as well.

That is the reason of performance differance, forestdb is very slow, plasma is much faster https://blog.couchbase.com/plasma-storage-engine-gsi/ cc @deepkaran.salooja .

https://blog.couchbase.com/plasma-index-performance/

This is unrelated. Did you tried this index.
Also add escape charter for _ in like   meta().id NOT LIKE "\\_sync%" 

 

CREATE INDEX IX_reporting_Person ON reporting((distinct (array n for n in [lower(reservationNumber), lower(firstName), lower(lastName)] end)),verificationStatus,lastPendingDocumentDate,startDate,isInternational,isOcrDirty,endDate,firstName,middleName,lastName,personId,securityPhotoMediaItemId,cardNumber,isReviewLater,verifiedBy) WHERE (type = "Person") and meta().id NOT LIKE "_sync%"

the escape character should be in main query or in index query?