ORDER_BY ASC on index is much slower than ORDER_BY DESC

I have a database with >10K records and use couchbase-lite-core opensource to manage.
The database has one of the indices with information: index name: “rec_duration”, on field: “duration”

When ORDER_BY duration (either DESC or ASC), here is the c4query_explain() and query performance result:

SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc WHERE (_doc.key LIKE 'rec:%') AND (_doc.flags & 1 = 0) ORDER BY fl_value(_doc.body, 'duration') DESC LIMIT MAX(0, 8) OFFSET MAX(0, 0)
10|0|0| SCAN TABLE kv_default AS _doc USING INDEX rec_duration
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\"DESC\",[\".duration\"]]]}]
{QueryEnum#4} Created on {Query#3} with 8 rows (3922 bytes) in 1.812ms

SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc WHERE (_doc.key LIKE 'rec:%') AND (_doc.flags & 1 = 0) ORDER BY fl_value(_doc.body, 'duration') LIMIT MAX(0, 8) OFFSET MAX(0, 0)
10|0|0| SCAN TABLE kv_default AS _doc USING INDEX rec_duration
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\".duration\"]]}]
{QueryEnum#6} Created on {Query#5} with 8 rows (3724 bytes) in 3540.810ms

You can see that DESC order_by takes only 1.812ms while default (ASC) order_by takes 3540.810ms.

I also read about sqlite indexing, they can search from both direction. I’m not sure about couchbase lite.
I’ve read same topics in the forum, but it seems they have problems with DESC order_by (opposite to my case) Is there any bug in my application? :thinking:

Anw, I haven’t found any solution yet. Can anyone here explain? Or is there something wrong with my query statement?
Thanks a lot!

This is quite bizarre. We don’t actually implement that functionality, and as you say SQLite indexing can search from both directions. If that is true, we are actually just leaving it up to SQLite so I don’t see why it would be a problem. @jens Do you have any ideas?

(I took the liberty of editing the logs in your post to make them more readable.)

So, the word SCAN in the query explanations show that neither is efficiently using indexes. I haven’t seen SCAN...USING INDEX before, but from reading the SQLite docs I think it indicates that the index is being used to sort the output, but not to find the rows.

So in both cases the query has to scan through every document in the database, collecting ones where the key matches the test. It’s using the rec_duration index to do the scan, so it finds docs in the right order.

I think the reason the descending query is faster is because it very quickly finds 8 matching documents and can stop. The ascending query is first going to scan through all the docs in the database with a null value for the duration property, and these probably don’t have IDs starting with rec:. Only after it gets through all of those will it start collecting any matching docs.

I’m not sure why the query optimizer didn’t decide to use the index on key (the table’s primary key). It may be that it doesn’t consider LIKE tests indexable; try replacing it with the equivalent of (id >= 'rec:' and id < 'rec;') (note the semicolon in the second string; this is the ASCII character after colon.)

OK, this is interesting and we may need to fix this in CBL. The SQLite query optimizer docs confirm that it does optimize LIKE expressions, but there are some constraints on when it can do so. I think a query like this one violates the constraints, particularly:

  • For the LIKE operator, if case_sensitive_like mode is enabled then the column must indexed using BINARY collating sequence, or if case_sensitive_like mode is disabled then the column must indexed using built-in NOCASE collating sequence.

Since we enable case_sensitive_like, the above implies that we need to update the table schema to change key TEXT PRIMARY KEY to key TEXT PRIMARY KEY COLLATE BINARY, to enable this sort of query to be indexed. (Unfortunately I don’t think it’s possible to update an existing table this way, so only new databases would get this optimization.)

I’ve filed a Jira issue on this.

I was wrong above: BINARY is the default collation, so the key column already has it. I’ve tried out some queries in the cblite tool and they do use the index on key. In your query, somehow SQLite is getting distracted by the fact there’s an index on the property you’re ordering by, and falsely decides this is a better index to use.

If you don’t need the index on duration for other queries, I’d remove it, since it’s hurting this query.

Thanks @jens for your great investigation.

So, the word SCAN in the query explanations show that neither is efficiently using indexes. I haven’t seen SCAN...USING INDEX before, but from reading the SQLite docs I think it indicates that the index is being used to sort the output, but not to find the rows.

I saw some sqlite explanation of SCAN... USING INDEX here: EXPLAIN QUERY PLAN, item 1.2. Temporary Sorting B-Trees. So I think SCAN ... USING INDEX is expected when order_by on an index.

If you don’t need the index on duration for other queries, I’d remove it, since it’s hurting this query.

I’ve tried to do so, then explain plan has USE TEMP B-TREE FOR ORDER BY. The performance is between ORDER_BY DESC and ORDER_BY ASC with index on duration.

In your query, somehow SQLite is getting distracted by the fact there’s an index on the property you’re ordering by, and falsely decides this is a better index to use.

Ok, so we cannot do anything in cblite :sob:

Thank you again for your support!

@jens @borrrden I think I know the issue comes from after readingthis ticket. My database contains about 90% records that don’t have duration field. I create another database without these 90% records, but only records with duration fields, the performance between order_by ASC and DESC are now very little different.

So as I can understand, order_by ASC may have to go through a lot of null values before it can reach the meaningful values of duration.

And if so, the jira ticket above can resolve the issue when it’s done?

Same thoughts?

You could specify ...AND duration IS NOT MISSING, so when the query uses the duration index it will quickly skip past all the entries with a null/missing value.

(This highlights that we should be creating partial indexes, which only contain non-MISSING values.)

You could specify ...AND duration IS NOT MISSING , so when the query uses the duration index it will quickly skip past all the entries with a null/missing value.

I tried this but it seems no much difference:

// ORDER_BY DESC, not use `...AND duration IS NOT MISSING`
"SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc 
WHERE (_doc.key LIKE 'rec:%') AND (_doc.flags & 1 = 0) 
ORDER BY fl_value(_doc.body, 'duration') DESC 
LIMIT MAX(0, 8) OFFSET MAX(0, 0)\n\n10|0|0| 
SCAN TABLE kv_default AS _doc USING INDEX rec_duration\n\n
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\"DESC\",[\".duration\"]]]}]\n"
[CBL] "{QueryEnum#6} Created on {Query#5} with 8 rows (3922 bytes) in 0.534ms"

// ORDER_BY DESC, use `...AND duration IS NOT MISSING`
"SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc
WHERE (_doc.key LIKE 'rec:%' AND fl_value(_doc.body, 'duration') IS NOT NULL) AND (_doc.flags & 1 = 0) 
ORDER BY fl_value(_doc.body, 'duration') DESC 
LIMIT MAX(0, 8) OFFSET MAX(0, 0)\n\n10|0|0|
SCAN TABLE kv_default AS _doc USING INDEX rec_duration\n\n
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"AND\",[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],[\"IS NOT\", [\".duration\"],[\"MISSING\"]]],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\"DESC\",[\".duration\"]]]}]\n"
[CBL] "{QueryEnum#8} Created on {Query#7} with 8 rows (3922 bytes) in 0.583ms"

// ORDER_BY ASC, not use `...AND duration IS NOT MISSING`
"SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc 
WHERE (_doc.key LIKE 'rec:%') AND (_doc.flags & 1 = 0) 
ORDER BY fl_value(_doc.body, 'duration') 
LIMIT MAX(0, 8) OFFSET MAX(0, 0)\n\n10|0|0|
SCAN TABLE kv_default AS _doc USING INDEX rec_duration\n\n
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\".duration\"]]}]\n"
[CBL] "{QueryEnum#10} Created on {Query#9} with 8 rows (3724 bytes) in 968.152ms"

// ORDER_BY ASC, use `...AND duration IS NOT MISSING`
"SELECT fl_result(fl_root(_doc.body)) FROM kv_default AS _doc
WHERE (_doc.key LIKE 'rec:%' AND fl_value(_doc.body, 'duration') IS NOT NULL) AND (_doc.flags & 1 = 0)
ORDER BY fl_value(_doc.body, 'duration') 
LIMIT MAX(0, 8) OFFSET MAX(0, 0)\n\n10|0|0| 
SCAN TABLE kv_default AS _doc USING INDEX rec_duration\n\n
[\"SELECT\",{\"WHAT\":[[\".\"]],\"WHERE\":[\"AND\",[\"LIKE\", [\".\",\"_id\"], \"rec:%\"],[\"IS NOT\", [\".duration\"],[\"MISSING\"]]],\"OFFSET\":0,\"LIMIT\":8,\"ORDER_BY\":[[\".duration\"]]}]\n"
[CBL] "{QueryEnum#12} Created on {Query#11} with 8 rows (3724 bytes) in 938.658ms"

Can you take a look again in the query explain plan @jens ?

The only workaround I can suggest, then, is not having an index on duration.

Got it. Thanks a lot, @jens!