Numeric IDs sort as strings in collections

Previously, in CBL 3.0, as well as in 3.1 when querying the default collection, documents are sorted by document ID numerically in query results. But when querying just the document IDs in a non-default collection in 3.1, the results are sorted by ID using a non-numeric string sort. For example:

val db = Database("test")
val col = db.createCollection("collection")
for (i in 1..20) {
    db.defaultCollection?.save(MutableDocument(i.toString()))
    col.save(MutableDocument(i.toString()))
}
db.createQuery("SELECT META().id FROM ${db.defaultCollection?.fullName}")
    .execute()
    .use { rs -> println("_default: " + rs.map { it.getString("id") }) }
db.createQuery("SELECT META().id FROM ${col.fullName}")
    .execute()
    .use { rs -> println("collection: " + rs.map { it.getString("id") }) }

will output:

_default: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
collection: [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 3, 4, 5, 6, 7, 8, 9]

Interestingly, if you add something else to the query SELECT, the sort works correctly numerically. E.g. changing SELECT META().id to SELECT META().id, * outputs:

_default: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
collection: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

The same occurs with document IDs that contain numbers, e.g. “doc-1”, “doc-2”, “doc-3”, etc.

UPDATE:
After further testing, I realized it’s not that it’s doing a numeric sort by document ID, but just returning the documents in their insertion order, while non-default collections when selecting only META().id are performing an additional string sort by document ID for some reason.

I suspect that if you file an issue, the response will be that sort order is not guaranteed without an ORDER BY.

I would expect the default ORDER BY to be by document ID and for that sort to be consistent between collections and SELECT clauses, which it’s not. The numeric sort would be the expected sort algorithm.

There is no default order. If no order is specified, the results can be presented in any order. Even from one execution of the same query to another execution. It’s easier/faster to return unsorted results than sorted.

In practice, I’ve found the default sort to be consistent, at least with Couchbase Lite. Testing this further, it seems this is actually based on insertion order, rather than an actual numeric sort. But now non-default collections when selecting only META().id are applying a string sort to the results.

I found this bug after updating my Kotlin Multiplatform CBL library’s paging extension to the 3.1 API using collections. The tests have consistently passed with the insertion order sort, and now the behavior I described above is consistent, just different between the default and non-default collections and when selecting only META().id or not.

Another interesting find, if I give the query an ORDER BY clause that’s not valid (property doesn’t exist), then it won’t apply the document ID string sort for non-default collections. So this seems to be an extra sort operation performed on the results that I wouldn’t expect to happen.

What would be the proper ORDER BY clause to perform a numeric sort by document ID? There’s not a cast operator to convert the document ID string expression to a number. Is ORDER BY (META().id + 0) the only/best way to do this?

And I guess there wouldn’t be a way to do this with a numeric substring in the ID, e.g. “doc-1”, since there’s no substring function.

When no order-by is specified, the results can be returned in any order. There is no right or wrong order when no order-by is specified. The order is not required to be consistent from one version to another, nor even from one execution to another. I understand how tests could have relied on order and have worked like that, however, tests that rely on the order should use order-by.

Hi @jeff.lockhart , I am responsible for the query part of CBL. We use SQLite3 as the internal table storage. The document id is the primary key of the table, but if don’t use the “ORDER BY” clause, the SQLite engine is not guaranteed to generate the result in any order. Practically, the row-order, or insertion order is quite likely. Since it’s the key, “ORDER BY meta().id” won’t cost any. Document ID is typed TEXT in SQLite table, so, I would think that the ordering of “doc-2, doc-10” is accidental.

Not to be “that guy” but you could always take the old NuGet approach and pad your digits with zeros :slight_smile:. That would result in a sane sort order. This is a problem that frequently emerges in many circles (file systems, package managers, etc).

This makes sense, thanks. I modified my tests to always perform an ORDER BY, with the + 0 numeric cast workaround.

Zero-padding is another potential workaround, which can handle the text prefix case as well, provided you provide enough zeros of course.

It does seem odd that non-default collections seem to perform an extra sort by META().id operation when querying only META().id though. And why doesn’t the default collection do the same thing?

The best ways to check the difference b/w using default and non default is to compare the explain() result.

It does seem odd that non-default collections seem to perform an extra sort by META().id operation when querying only META().id though. And why doesn’t the default collection do the same thing?

Just speculating - but under the covers, maybe the scope/collection becomes the prefix of the id.

compare the explain() result.

Good idea. I was just comparing the logged raw queries initially. But looking at the explain, it looks like the two differences are:

  • The default collection has WHERE (_default.flags & 1 = 0)
  • The non-default collection has USING COVERING INDEX sqlite_autoindex_kv_.scope-name.collection-name_1

So it looks like non-default collections have this covering index for document IDs that the query uses, which is why the results are sorted by those IDs from the covering index.

Interesting, I wonder if this means some queries without an index are more efficient in non-default collections vs the default collection.

I can explain (heh) both of those:

  • The default collection is still using a flag to mark whether a doc is deleted. In some recent release we moved to a more efficient schema where deleted documents are moved to a different table entirely; the intention was to upgrade the default collection to this, but it turned out to be too slow in some huge databases so we disabled that. New collections (including the default one in a new db) are created using the optimized deletion.
  • I can’t figure out what that index is – it’s created by SQLite to optimize queries, possibly based on a UNIQUE constraint somewhere or else based on runtime stats, but we don’t have any columns named ‘scope-name’ or ‘collection-name’. Unless – did you redact the names of the actual scope and collection? In that case the index name makes sense, but I don’t know what it does. You could find out by opening the db with the sqlite3 tool and running the “.schema” command.

In the end, the differences in query plans come down to the black magic done by SQLite’s query optimizer, which involves a bunch of statistics it keeps about the size of tables and indexes, performance of previous queries, etc. We don’t have any insight into this.

3 Likes

Thanks for those insights. Always interesting to learn more how things work under the hood of Couchbase Lite.

Yes, scope-name and collection-name are just the name of the non-default scope and collection. (It also seems to escape all of the capital letters in the names.)

The database was created by the latest version of CBL 3.1.3. So I’m assuming the default collection is using the optimized deletion mechanism, it just uses the flag in the query regardless?