Numeric IDs sort as strings in collections

jeff.lockhart · December 30, 2023, 1:54am

Previously, in CBL 3.0, as well as in 3.1 when querying the default collection, documents are sorted by document ID numerically in query results. But when querying just the document IDs in a non-default collection in 3.1, the results are sorted by ID using a non-numeric string sort. For example:

val db = Database("test")
val col = db.createCollection("collection")
for (i in 1..20) {
    db.defaultCollection?.save(MutableDocument(i.toString()))
    col.save(MutableDocument(i.toString()))
}
db.createQuery("SELECT META().id FROM ${db.defaultCollection?.fullName}")
    .execute()
    .use { rs -> println("_default: " + rs.map { it.getString("id") }) }
db.createQuery("SELECT META().id FROM ${col.fullName}")
    .execute()
    .use { rs -> println("collection: " + rs.map { it.getString("id") }) }

will output:

_default: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
collection: [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 3, 4, 5, 6, 7, 8, 9]

Interestingly, if you add something else to the query SELECT, the sort works correctly numerically. E.g. changing SELECT META().id to SELECT META().id, * outputs:

_default: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
collection: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

The same occurs with document IDs that contain numbers, e.g. “doc-1”, “doc-2”, “doc-3”, etc.

UPDATE:
After further testing, I realized it’s not that it’s doing a numeric sort by document ID, but just returning the documents in their insertion order, while non-default collections when selecting only META().id are performing an additional string sort by document ID for some reason.

mreiche · December 30, 2023, 3:29pm

I suspect that if you file an issue, the response will be that sort order is not guaranteed without an ORDER BY.

jeff.lockhart · December 30, 2023, 4:45pm

I would expect the default ORDER BY to be by document ID and for that sort to be consistent between collections and SELECT clauses, which it’s not. The numeric sort would be the expected sort algorithm.

mreiche · December 30, 2023, 6:51pm

There is no default order. If no order is specified, the results can be presented in any order. Even from one execution of the same query to another execution. It’s easier/faster to return unsorted results than sorted.

jeff.lockhart · December 30, 2023, 9:10pm

In practice, I’ve found the default sort to be consistent, at least with Couchbase Lite. Testing this further, it seems this is actually based on insertion order, rather than an actual numeric sort. But now non-default collections when selecting only META().id are applying a string sort to the results.

I found this bug after updating my Kotlin Multiplatform CBL library’s paging extension to the 3.1 API using collections. The tests have consistently passed with the insertion order sort, and now the behavior I described above is consistent, just different between the default and non-default collections and when selecting only META().id or not.

Another interesting find, if I give the query an ORDER BY clause that’s not valid (property doesn’t exist), then it won’t apply the document ID string sort for non-default collections. So this seems to be an extra sort operation performed on the results that I wouldn’t expect to happen.

What would be the proper ORDER BY clause to perform a numeric sort by document ID? There’s not a cast operator to convert the document ID string expression to a number. Is ORDER BY (META().id + 0) the only/best way to do this?

And I guess there wouldn’t be a way to do this with a numeric substring in the ID, e.g. “doc-1”, since there’s no substring function.

mreiche · December 30, 2023, 10:46pm

When no order-by is specified, the results can be returned in any order. There is no right or wrong order when no order-by is specified. The order is not required to be consistent from one version to another, nor even from one execution to another. I understand how tests could have relied on order and have worked like that, however, tests that rely on the order should use order-by.

jianminzhao · January 2, 2024, 6:43pm

Hi @jeff.lockhart , I am responsible for the query part of CBL. We use SQLite3 as the internal table storage. The document id is the primary key of the table, but if don’t use the “ORDER BY” clause, the SQLite engine is not guaranteed to generate the result in any order. Practically, the row-order, or insertion order is quite likely. Since it’s the key, “ORDER BY meta().id” won’t cost any. Document ID is typed TEXT in SQLite table, so, I would think that the ordering of “doc-2, doc-10” is accidental.

borrrden · January 2, 2024, 9:50pm

Not to be “that guy” but you could always take the old NuGet approach and pad your digits with zeros . That would result in a sane sort order. This is a problem that frequently emerges in many circles (file systems, package managers, etc).

jeff.lockhart · January 3, 2024, 7:13pm

This makes sense, thanks. I modified my tests to always perform an ORDER BY, with the + 0 numeric cast workaround.

Zero-padding is another potential workaround, which can handle the text prefix case as well, provided you provide enough zeros of course.

It does seem odd that non-default collections seem to perform an extra sort by META().id operation when querying only META().id though. And why doesn’t the default collection do the same thing?

pasin · January 15, 2024, 4:11pm

The best ways to check the difference b/w using default and non default is to compare the explain() result.

mreiche · January 15, 2024, 6:28pm

It does seem odd that non-default collections seem to perform an extra sort by META().id operation when querying only META().id though. And why doesn’t the default collection do the same thing?

Just speculating - but under the covers, maybe the scope/collection becomes the prefix of the id.

jeff.lockhart · January 17, 2024, 1:49am

compare the explain() result.

Good idea. I was just comparing the logged raw queries initially. But looking at the explain, it looks like the two differences are:

The default collection has WHERE (_default.flags & 1 = 0)
The non-default collection has USING COVERING INDEX sqlite_autoindex_kv_.scope-name.collection-name_1

So it looks like non-default collections have this covering index for document IDs that the query uses, which is why the results are sorted by those IDs from the covering index.

Interesting, I wonder if this means some queries without an index are more efficient in non-default collections vs the default collection.

jens · January 17, 2024, 5:57pm

I can explain (heh) both of those:

The default collection is still using a flag to mark whether a doc is deleted. In some recent release we moved to a more efficient schema where deleted documents are moved to a different table entirely; the intention was to upgrade the default collection to this, but it turned out to be too slow in some huge databases so we disabled that. New collections (including the default one in a new db) are created using the optimized deletion.
I can’t figure out what that index is – it’s created by SQLite to optimize queries, possibly based on a UNIQUE constraint somewhere or else based on runtime stats, but we don’t have any columns named ‘scope-name’ or ‘collection-name’. Unless – did you redact the names of the actual scope and collection? In that case the index name makes sense, but I don’t know what it does. You could find out by opening the db with the sqlite3 tool and running the “.schema” command.

In the end, the differences in query plans come down to the black magic done by SQLite’s query optimizer, which involves a bunch of statistics it keeps about the size of tables and indexes, performance of previous queries, etc. We don’t have any insight into this.

jeff.lockhart · January 17, 2024, 9:03pm

Thanks for those insights. Always interesting to learn more how things work under the hood of Couchbase Lite.

Yes, scope-name and collection-name are just the name of the non-default scope and collection. (It also seems to escape all of the capital letters in the names.)

The database was created by the latest version of CBL 3.1.3. So I’m assuming the default collection is using the optimized deletion mechanism, it just uses the flag in the query regardless?

Topic		Replies	Views
How to sort document id using ASC or DESC Couchbase Server	2	2239	June 17, 2019
Sorting and querying documents in couchbase-lite Couchbase Lite java	2	1808	February 27, 2018
Sort id stored as a string as a number in a full text search Full Text Search n1ql	5	672	December 7, 2023
Sorting by document values in couchbase and scala Couchbase Server	0	1864	March 27, 2014
Sorting Query for all type of value SQL++ query , n1ql , server	3	1770	April 25, 2019

Numeric IDs sort as strings in collections

Related topics