N1QL SELECT * VS SELECT <lots of fieldnames> performance

When I have a n1ql query that I need pretty much all the data form each document. Unfortunately, also not a cover query, and needs to be ordered, so it takes a long time to process.

Everything else being equal, I am noticing much better performance doing SELECT * than SELECT (almost all my field names, comma, seperated, list).

Back in the relational database days selecting just the fields needed was a best practice and typically resulted in better performance, but I wonder if there is any recommendation from Couchbase on this in the N1QL/GSI perspective?

1 Like

I’m sure others will comment on best practices with respect to N1QL and GSI. Another avenue you can explore is use of some of the other services directly from your application.

It depends a bit on your system architecture and goals with respect to where you want processing to occur and what kind of network capacity you have at application servers and query servers, but one option may be to select just the ids, possibly with your most selective filter, and then bulk fetch the documents directly.

With RxJava you can then sort the items as they come in asynchronously from the nodes that have the documents. This reduces the transport cost, since you’d only need to move the full document across the network once. It does increase the time and space at the application side though. Since we have a streaming parser for queries and you can apply the filter as documents come in, additional space usage shouldn’t be too bad.

A simple Java example of this using travel-sample is:

    private static List<JsonObject> queryWithBulkFetch(Bucket bucket, int limit) {
        String bucketName = "`" + bucket.name() + "`";
        String q = "SELECT meta().id as id FROM " + bucketName + " LIMIT " + limit;
        return bucket.async().query(N1qlQuery.simple(q))
            .doOnNext(res -> res.info().map(N1qlMetrics::elapsedTime).forEach(t -> System.out.println("N1QL Took: " + t)))
            .flatMap(AsyncN1qlQueryResult::rows)
            .flatMap(row -> bucket.async().get(row.value().getString("id")))
            .map(JsonDocument::content)
            .toList()
            .toBlocking()
            .single();
    }

Other SDKs have similar functionality.

In your case, you’ll have some other filters and possibly a transformation of the document before that map() call. If you decide to try this, I’d be interested in how it works out!

There’a a full program that @daschl put together here:

Thanks @ingenthr. I think the majority of the time for this application the query filters will require that the entire document be retrieved in order for the n1ql query service to do the comparisons - would it still be the case that grabbing the doc ID then retrieving from the bucket would be better?

That’s a difficult question to answer. A lot of it comes down to what kind of cpu/memory/network resources you have at the query nodes and app servers and how efficiently we make use of those resources.

With some recent experience on something based on the sample I pointed to, we’ve seen environments where the overall fetch is only milliseconds longer than the query time itself. We’ve also seen the same code run in other environments where it takes much longer to complete the overall work, but is still faster than doing it all at the query service.

Since you need to evaluate the whole document after applying whatever selective query you’re doing (i.e., you don’t have a selective enough index and can’t make good use of covering indexes), my suspicion is that this technique will help.

Hi @jberglund,

To your original question. If you are getting (almost) all the fields from a (large) document, the following two should be the most efficient:

SELECT d FROM mybucket d …;

SELECT * FROM mybucket d …;

There are two possible issues with selecting a large number of fields: (1) If you are not using prepared statements, there are more expressions to parse / process / plan. (2) During projection, there is a bit of extra work to process each field, and we are allocating new objects to hold the subset of fields that you are interested in, as opposed to using the current data object.

The difference with RDBMS is a longer conversation; one issue is that we are schemaless, so the execution pipeline is more dynamic / flexible, and there is some cost.

For ordering, you should try using an index with 4.5. If the index supports your query order, N1QL will skip the sort altogether, and you will see major speedup. There are some interesting blogs on dzone by @keshav_m .

Finally, with the upcoming 4.5.1, we are also improving fetch performance, so you will not need to worry about fetching and processing data separately in your application, and can just use N1QL.

If you want to post a sample query and index and Couchbase version, we can take a further look.

@ingenthr

I think one issue with the post-n1ql-retrieval of the documents is that, when async, the order is no longer guaranteed to be the same from the ORDER BY in the N1QL query. When talking about hundreds of thousands of smallish (1kb) documents, I don’t want to keep it in memory to order them. I suppose I could make it all blocking but I think that would kill performance.

@geraldss
Couchbase Version: 4.1.0-5005 Enterprise Edition (build-5005), 3 node cluster
1.6 million docs using 6 out of 18gb in the bucket.

I’m hitting the right index based on the EXPLAIN plan (my index is on a string field and a numeric time field). About 30-40 fields in the document, not a huge document.

SELECT * FROM bwecl AS ecl WHERE serviceProviderUid = ‘SP2200000000’ AND startTime BETWEEN 1469836800000 AND 1469948720839 ORDER BY startTime DESC;
VS
SELECT serviceProviderId, releaseTime, groupId, callingAssertedNum, primaryExtension, type, redirectingReason, callAuthCode, callingPresentationInd, redirectingNumber, detachedAnswerTime, svcInvocNetworkType, connectedName, callLogType, logId, startTime, primaryExtensionNumeric, networkType, basicCallType, connectedNumberSource, callId, callingPresentationNumRestrict, subscriberType, docType, networkTranslatedAddress, svcInvocBasicCallType, primaryPhoneNumber, detachedTime, userId, svcInvocNetXlaAddress, serviceProviderUid, deleted, redirectingName, redirectingPresInd, callingPresentationName (you get the idea…)
BETWEEN 1469836800000 AND 1469948720839 ORDER BY startTime DESC;
{ "serviceProviderId": "serviceProviderGoesHere", "releaseTime": 1469455695000, "groupId": "groupGoesHere", "callingAssertedNum": "+19726966650", "primaryExtension": "600", "type": "eclCallLog", "redirectingReason": "hunt-group", "callAuthCode": "4321", "callingPresentationInd": 1, "redirectingNumber": "+19726966670", "detachedAnswerTime": 1469455706000, "svcInvocNetworkType": 1, "connectedName": "user600@domain", "callLogType": 2, "logId": "++ecl++U22000000020", "startTime": 1469455653000, "primaryExtensionNumeric": 600, "networkType": 1, "basicCallType": 2, "connectedNumberSource": 4, "callId": "71e5618d-bafe-46d1-8e66-65fc25f91ce2", "callingPresentationNumRestrict": false, "subscriberType": 5, "docType": "eclCallLog", "networkTranslatedAddress": "+19726966600", "svcInvocBasicCallType": 2, "primaryPhoneNumber": "+19726966600", "detachedTime": 1469455703000, "userId": "user600@jberglund.rtx.broadsoft.com", "svcInvocNetXlaAddress": "+19726966600", "serviceProviderUid": "SP2200000000", "deleted": false, "redirectingName": "user670@domain", "redirectingPresInd": 1, "callingPresentationName": "user650@domain", "svcInvocDisposition": 19, "authAcctCode": "1234", "svcInvocCallCategory": 1, "callingGroupId": "North_as66", "callingPresentationNumSource": 4, "docTypeVersion": 1, "connectedPresentationInd": 1, "groupUid": "G2200000000", "userUid": "U2200000002", "callingPresentationNum": "+19726966650", "connectedNumber": "+19726966600" }

EDIT: forgot to mention, most situations will probably not have only two filters, our application supports searching on about 11 of them.

Ok, a few suggestions.

(1) Create your index on ( serviceProviderUid, -startTime ). Note the negation of startTime, because you want descending order. We will adding better support for descending order.

(2) Set your WHERE clause on serviceProviderUid and the negation of startTime. So -startTime BETWEEN -x AND -y. Again, we will make this negation go away soon.

(3) Set your ORDER BY on ( serviceProviderUid, -startTime ASC ). You need serviceProviderUid so that it uses index order. (We should recognize this automatically, since there is an equality predicate on serviceProviderUid.) Note the negation of startTime, and the use of ASC instead of DESC. This gives you the sort order you want.

Finally, I would suggest you try this on both 4.1 and 4.5, and tell us the performance comparison. These changes will swamp any difference in your SELECT clause.

@geraldss

Thanks, I will try these suggestions. On item 3) I did some tests before on including the full index on the ORDER BY clause, I didn’t see much of a difference (though perhaps thats because im DESC rather than the natural order?)

Yes, you ran into the DESC issue. (3) will serve you well :slight_smile:

If the order can’t be determined by the index, then something will need to receive all of the data and sort it either on disk or in memory. It looks like @geraldss has a suggestion on how to define an index that supports the order you want.

It is true that if you’re reacting to the rows as they come into the program from the streaming parser, you won’t necessarily have them all in memory at once, but if it’s sorted in memory at the cluster side and then shipped all at once, you’ll probably have most of it in memory at the application client.

Is there a JIRA for that issue?

JIRA tickets:

https://issues.couchbase.com/browse/MB-20420
https://issues.couchbase.com/browse/MB-19917
https://issues.couchbase.com/browse/MB-9244

@geraldss

I’ve tested on 4.1.0 and see no real difference in response times.

Indexes:
Definition: CREATE INDEX bwidx3_bwecl_sp_startTime ON bwecl(serviceProviderUid,startTime) USING GSI
Definition: CREATE INDEX bwidx_bwecl_sp_rev_startTime ON bwecl(serviceProviderUid,(-startTime)) USING GSI

Performing query: SELECT * FROM bwecl WHERE serviceProviderUid = ‘SP2200000000’ AND startTime BETWEEN 1420070400000 AND 1483228800000 ORDER BY serviceProviderUid, startTime DESC;
EXPLAIN PLAN INDEX: bwidx3_bwecl_sp_startTime
ElapsedTime: 87s
51201 rows

Performing query: SELECT * FROM bwecl WHERE serviceProviderUid = ‘SP2200000000’ AND -startTime BETWEEN -1483228800000 AND -1420070400000 ORDER BY serviceProviderUid, -startTime ASC;
EXPLAIN PLAN INDEX: bwidx_bwecl_sp_rev_startTime
ElapsedTime: 86s
51201 rows

Can you post the full EXPLAIN output for the second query / index (the rev).

{ "requestID": "1bed9c52-8b79-4555-8201-2dea969ed3f9", "signature": "json", "results": [ { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "IndexScan", "index": "bwidx_bwecl_sp_rev_startTime", "keyspace": "bwecl", "namespace": "default", "spans": [{"Range": { "High": [ "\"SP2200000000\"", "-1420070400000" ], "Inclusion": 3, "Low": [ "\"SP2200000000\"", "-1483228800000" ] }}], "using": "gsi" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Fetch", "as": "ecl", "keyspace": "bwecl", "namespace": "default" }, { "#operator": "Filter", "condition": "(((ecl.serviceProviderUid) = \"SP2200000000\") and ((-(ecl.startTime)) between (-1483228800000) and (-1420070400000)))" }, { "#operator": "InitialProject", "result_terms": [ { "expr": "self", "star": true }] } ] } } ] }, { "#operator": "Order", "sort_terms": [ {"expr": "(ecl.serviceProviderUid)"}, {"expr": "(-(ecl.startTime))"} ] }, {"#operator": "FinalProject"} ] }], "status": "success", "metrics": { "elapsedTime": "18.73612ms", "executionTime": "18.356443ms", "resultCount": 1, "resultSize": 3130 } }

edit: was messing with the ORDER BY clause and let out the serviceProviderUid
Performing query: SELECT * FROM bwecl AS ecl WHERE serviceProviderUid = ‘SP2200000000’ AND startTime BETWEEN 1420070400000 AND 1483228800000 ORDER BY serviceProviderUid, startTime DESC;
ElapsedTime: 87s
Rows: 51201
Performing query: SELECT * FROM bwecl AS ecl WHERE serviceProviderUid = ‘SP2200000000’ AND -startTime BETWEEN -1483228800000 AND -1420070400000 ORDER BY serviceProviderUid, -startTime ASC;
ElapsedTime: 86s
Rows: 51201

Ok. The index scan looks fine.

(1) The next thing would be to try to use a covering index. Can you select a specific set of fields, and add those to the end of the index?

(2) After that, the next thing would be to query for META().id only, and then fetch those documents using the key-value API.

We are making improvements to the document fetch in 4.5.1.

Can you try (1) above and tell us how that performs.

Thanks,
Gerald

@geraldss

Unfortunately the nature of our application (upgrading from an old Oracle DB schema) is to support the query with about 11 filters, 9 of which don’t have to be included, so i think are not eligible for the index. As I understand it, I have to include some value or range for all parts of the index for it to be used. Also, I’m going to end up needing almost the entire document for output.

edit: yes, grabbing the meta.id() is about 50% faster in this lab im testing on, but i can’t do a cover query.

Ok. More improvements on the way.

Interesting topic. @jberglund Are there any new insights in the meantime?

Yes, @jberglund @synesty you can try 4.5.1 and 4.6.