Access to index without SQL++ query

I believe this refers to this query.

SELECT RAW META().id FROM default AS d WHERE a between 1 and 2 and b between 1 and 1

compared with this query

SELECT RAW META().id FROM default AS d WHERE [a,b] >= [1,1] AND [a,b] <= [2,1]

If we apply those queries to a=1, b=3 , the first query does not match, but the second one does. But we can write the equivalent to the second query without using arrays. I’m going to change the window from [1,1] to [3,1] to make this a little more obvious.

SELECT * FROM default AS d WHERE (a > 1 OR (a=1 and b >=1)) AND (a < 3 OR a=3 and b<=1)

If I’m not mistaken that does give the same result. If we take that query, and substitute the literals with parameters (as your application will)

SELECT * FROM default AS d WHERE (a > $1 OR (a=$1 and b >=$2)) AND (a < $3 OR a=$3 and b<=$4)

Then use Index Advisor and create the indexes, then create the query plan and see if it is to your liking. So you have two ways to get the same results.

I’ve attached the query plan from running the query on the sample data that vsr1 provided, using the window [1,1] [2,1]
plans.zip (1.7 KB)

I’m still curious as to how “descending keys” are used.

1 Like

Thanks, I mean:

CREATE INDEX ix1 ON default(a, b desc, c);

From recollection (I’ll reconfirm), it’s just a matter of inverting the comparison for the second value in the key (b) inside the boolean expression you gave.

it’s just a matter of inverting the comparison for the second value in the key (b) inside the boolean expression you gave.

Wouldn’t that be the same as leaving in ascending order and leaving the comparison as-is?
I can see how it would help if the query specified an upper bound, and descending order and a LIMIT, but if it is a range from a lower bounds to an upper bounds, I don’t see how the order ascending or descending matters. The bounds can be swapped and the index traversed in the opposite direction and the result is the same. Maybe I’m missing something.

Rewriting the queries to not rely on composite keys solves the issue of the composite keys not being correct. That’s how we got to this query

SELECT * FROM default AS d WHERE (a > $1 OR (a=$1 and b >=$2)) AND (a < $3 OR a=$3 and b<=$4)

The other option is to have a composite key for one sort-direction, and another composite key for another sort-direction.

It’s not the same because the lexicographic ordering is different, and it matters to the application.
That’s why desc is allowed as a keyword at index creation. Say “a” is the last name (ascending), “b” is the age (descending), and “c” is the first name (ascending). It makes no sense, I know, but it’s just for the sake of the example. We would have in the index (and the result of the select too):

“Doe”, 32, “Jane”
“Doe”, 28, “Emma”
“Doe”, 28, “Laura”
“Smith”, 90, “Bob”
“Smith”, 90, “John”
“Smith”, 18, “Albert”

And it seems (I am careful) that while the select works fine (and allows keyset pagination), that it does a full index scan. It does it (that’s what we see) as long as the index is created using something of the form (a, b, c), no descending key at all, or (a, b desc, c) - in that case it’s the only form allowed I guess. The select performs a true slice of the index only when a JSON array like ([a, b, c]) is specified. That form is possible only if all the keys are ascending (or all descending, I guess).

Thanks for your help and your patience, Michael, btw, much appreciated!

It doesn’t do full index scan (NOTE: couchbase doesn’t support REVERSE scan) It all depends on predicates. In above case If a >= “Smith” AND b > 50 AND b < 91 index will never scan from start. It starts from “Smith”, 91, “Bob” AND stops at “Smith”, 18, “Albert” . That is way btree scan works.
In plain imagine map real case familiar example index like dictionary. Each index key is letter in the word.
Now reading disk is pages and word scan is with in page indexscan.

It does it [full index scan] (that’s what we see) as long as the index is created using something of the form … (a, b desc, c)

Can you show the query, indexes used by the query (from index advisor) and the query plan (after executing the query)? We’re not expecting a full index scan here - maybe something else is going on.

Yes, as explained before, that’s not the select we want. You are changing it, and it does not return the same documents.

We need index creation of the form:

CREATE INDEX ix1 ON default(a, b);

So that we can generalize it to the desc case. And this is a select that would return the rows we expect (a full slice of the index):

SELECT RAW META().id FROM default AS d WHERE [a,b] >= [1,1] AND [a,b] <= [2,1]

The select:

SELECT RAW META().id FROM default AS d WHERE (a > 1 OR (a=1 and b >=1)) AND (a < 3 OR a=3 and b<=1)

Does the same, it would be fine as well. It is the only one possible as soon as a desc appears in the creation of the index (with a few changes in the comparisons), so we would expect it to behave properly too.

Do both of these selects perform a retrieval from the index using [1, 1] as a low? We had the (possibly mistaken) feeling reading the execution plan that the first does a full index scan, or a very inefficient use of [1, 1].

I can insert the execution plan later.

As I recall from yesterday (see my earlier post) , index advisor gives more than one index. And after creating them, the query plan has four index scans - each with lower and upper bounds (i.e not full index scans) that are combined into a UnionScan (see plans.zip that I attached yesterday).

We had the (possibly mistaken) feeling reading the execution plan that the first does a full index scan, or a very inefficient use of [1, 1].

That’s not what plans.zip shows. I suspect because you do not have all the indexes advised by index advisor.

Here is what I read in your uploaded plan.zip file:

            "range": [
              {
                "high": "2",
                "inclusion": 0,
                "index_key": "`a`",
                "low": "1"
              }
            ]

It does not seem right? The condition is [a,b] >= [1,1] AND [a,b] <= [2,1]
Low should be [1,1]. In case there are ten million documents with a = 1, it’ll be super inefficient?

Im on my phone right now so I can’t look at the query plan. The query plan is for the query that does not use arrays. If you want the query plan for the query that does use arrays, I gave instructions on how to do that. Or I can do that when I’m near a computer.

Yes, that’s what I meant, sorry. When all keys are ascending, the low and high are properly set. Now when we introduce descending, we have to use the complex boolean condition, and then it does not use the index efficiently. That’s why I was concluding that as soon as there is a mix of ascending and descending, the select becomes (potentially) inefficient, and keyset pagination does not work nicely.

As mentioned ARRAYs complex.
Non-arrays (sclars) once query pass spans indexer knows how optimally scan the index.
In case desc key it uses high starting and stop at low. In case of compoiste after read from the disk it knows how to apply individual keys and eliminate false positives

SELECT RAW META().id FROM default AS d WHERE [a,b] >= [1,1] AND [a,b] <= [2,1]

equivalent

SELECT RAW META().id FROM default AS d WHERE (a > 1 AND a < 2) OR (a=1 and b >=1) OR (a=2 and (b<=1 OR b IS NOT VALUED))

OR

WHERE (a >= 1 AND a <= 2) AND (a=1 and b >=1) AND (a=2 AND (b<=1 OR b IS NOT VALUED))

Yes, and from what I read in the execution plan, the select with the complex boolean condition only uses the “a” value to find its way in the btree, and then does a sequential scan. If there are millions of documents with the same “a” value, response time is very bad.

In plans.zip,

"condition": "(((1 < (`d`.`a`)) or (((`d`.`a`) = 1) and (1 <= (`d`.`b`)))) and (((`d`.`a`) < 2) or (((`d`.`a`) = 2) and ((`d`.`b`) <= 1))))"

It looks like the query would have been a query on the range [1,1] - [2,1], which is…

... WHERE (a > 1 OR (a=1 and b >=1)) AND (a < 2 OR a=2 and b<=1) order by meta().id

And your question is about the range scan :

             ...
             "range": [
              {
                "high": "2",
                "inclusion": 0,   // 0 -> exclusive
                "index_key": "`a`",
                "low": "1"
              }
            ]

Low should be [1,1]. In case there are ten million documents with a = 1, it’ll be super inefficient?

The part you are missing is that inclusion:0 is exclusive. This range scan is for the middle part, (where ‘a’ is between the top and the bottom exclusive) means that b does not have to be considered. (unfortunately for this example due to our choice the range 1 - 2 for a, it will always be empty. If we had chosen 1- 5 , that range would be low=1, high=5, exclusive - and would include all the 2,* through 4,* documents - which is correct and minimal.) We still need the 1.* documents where b >= 1 and the 2.* documents where b <=1.

Let’s look at the ranges in the other IndexScans

The first range is for a=1 (i.e. 1.*) and the second range is for b >= 1). This will get the documents on the bottom end where a matches our bottom (a=1), which means the predicate then depends only on b.

             "index": "adv_a_b",
              ...
              "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": "1",
                    "inclusion": 3, // exact match
                    "index_key": "`a`",
                    "low": "1"
                  },
                  {
                    "inclusion": 1, // inclusive
                    "index_key": "`b`",
                    "low": "1"
                  }
                ]
              }
            ]

The first range is for a=2 (i.e. 2.*) and the second range is for b <= 1). This will get the documents on the bottom end where a matches our bottom (a=1), which means the predicate then depends only on b.

             "index": "adv_a_b",
               ...
              "spans": [
              {   
                "exact": true,
                "range": [
                  {   
                    "high": "2",
                    "inclusion": 3, // exact match
                    "index_key": "`a`",
                    "low": "2" 
                  },  
                  {   
                    "high": "1",
                    "inclusion": 2,  // inclusive
                    "index_key": "`b`",
                    "low": "null"
                  }   
                ]   
              }   
            ]

I cannot speak to the efficiency of the last IndexScans. They might suffer from what you mentioned, depending on the data.

Low should be [1,1]. In case there are ten million documents with a = 1, it’ll be super inefficient?

Since you have this data, and have the queries you can test it to see which approach works better.

Also - regarding needing the index to be ascending or descending because the application needs the results in order - only ORDER BY guarantees the order. Having the indexes ordered does not guarantee the results being ordered (and given the query plan in plans.zip is a union of three indexScans, I would not expect them to be in order). Using the array index - I would expect them to be in order, but that is not guaranteed.

That was my next (and last, I promise) question. We’re using order by, but we were wondering if it was not worse. A lot of sort algorithms are very inefficient when items are already sorted.

You answered it before I asked!

Using arrays seems to have advantages in only needing one index, the queries are simpler, the query plans are obviously efficient and the results are already sorted (not “guaranteed”, but it looks to me that they would always be in order) . It looks like a winner.

1 Like

I’ll make a few trials once I can put my hands on a Couchbase console. Am on the road right now. In the meanwhile, big thanks!

Here are the results of my experiments. An uneducated cursory look at the execution plans tell me that we’re fine, but we would be grateful to have confirmation.

We have created two simple indexes (without JSON array):

CREATE INDEX abc ON test(a, b, c)
CREATE INDEX abdescc ON test(a, b desc, c)

Here are two execution plans for the first one:

abc.par.lex.txt.zip (1.3 KB)
abc.par.pre.txt.zip (1.1 KB)

And one for the second one:

abdescc.par.lex.txt.zip (1.4 KB)

Instead of using JSON array comparisons we’ve used the versatile lexicographic boolean expression. All selects take a slice of the respective index.

The execution plans show that all parameters are used in the ranges, so we are under the assumption that even if there are lots of documents with the same ‘a’ value, the selects will be very fast. Do you confirm?

As usual, thanks a lot!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.