GROUP BY performance

Hi,

We’d like to pick the community’s brain on the optimal way to implement N1QL GROUP BY query on a Full Ejection bucket with 10 million documents.

Here is an example of what we are trying to do:

SELECT name, SUM(requests)
FROM Requests
WHERE date LIKE "%06/07/2017%"
AND
id IS NOT MISSING
GROUP BY name;

We created both primary index and secondary index on the bucket. Yet, the query doesn’t return (runs endlessly).

Any suggestions on how to design this whole thing properly?

Could you please post the sample document and index definition.

NOTE: Predicate date LIKE “%06/07/2017%” AND id IS NOT MISSING requires full index scan on the whole index (due to leading %). i.e. 10 million documents. On top it needs to GROUP BY.

Thank you!

I’ll post an example doc later today. Meanwhile, if we drop % and IS
MISSING, how much faster you think the execution will be? Also, will
primary index be enough, or do we need to create a specific secondary index?

Alex

It depends. The reason I asked sample document is
want date field format in the document if it is ISO 8601 timestamp it is much easier to query(i.e. “2017-07-06T10:02:46.808-07:00”). You need leading % and can push partial string to the Indexer and produce less items.

Is date format is same in all the documents.

Also requests (i.e sum argument) is not bucket name Requests
Based on that I will suggest index so that avoid fetch.

NOTE: provide sample document that fields referenced in query only

Hi,

Requests is the name of the bucket, and requests is the field of the
document.

Here is how our date field of the corresponding document looks like:
“date”: "06/07/2017 22-56-04.648"
With this formatting its simply easier to read the date and time, comparing
to the ISO 8601 timestamp format. If needed, we can definitely change it to
whatever will make CouchBase’s life easier.

Here is the (shortened) example structure of the document:

{
“date”: “06/07/2017 22-56-04.648”,
“country”: “…”,
“zipCode”: “…”,
“id”: “…”,
“name”: “…”,
“requests”: 1
}

So the question is - what index should we prepare in order to run
previously mentioned N1QL query?

Alex

Hi @alex1,
In JSON there is no date/datetime/timestamp data type. If the date is stored as string in ISO 8601 timestamp format it can string comparable and do even range queries. The format you have can’t do range queries can do only equality/like queries. Also N1QL provides many DATE functions all works on ISO 8601 format.
https://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/datefun.html

For your query create the following index. It uses covered index (i.e no data Fetch).

CREATE INDEX ix1 ON Requests(date, id, name, requests);
SELECT name, SUM(requests)
     FROM Requests 
    WHERE date LIKE "06/07/2017%" AND id IS NOT MISSING
    GROUP BY name;

If the date is ISO 8601 format you can even do this

Only 2016 and 2017 first 6 months data

SELECT name, SUM(requests)
      FROM Requests
      WHERE date BETWEEN  "2016-01-01" AND "2017-06-30" AND id IS NOT MISSING 
      GROUP BY name;

Only 2017 and group BY month

SELECT SUBSTR(date,5,2) AS month, SUM(requests)
         FROM Requests
         WHERE date BETWEEN  "2017-01-01" AND "2017-12-31" AND id IS NOT MISSING 
         GROUP BY SUBSTR(date,5,2);

SELECT DATE_PART_STR(date,"month") AS month, SUM(requests)
          FROM Requests
          WHERE date BETWEEN  "2017-01-01" AND "2017-12-31" AND id IS NOT MISSING
          GROUP BY DATE_PART_STR(date,"month");

In ISO 8601 format if format is differ also comparable (i.e “2017-07-06T20:43:10.033-07:00” > “2017-07-06”)

You can also set max_parallelism parameter to make it parallel.

Following N1QL: A PRACTICAL GUIDE will have useful information.

Thank you so much!

We changed the date format to match ISO 8601, queries that you proposed return the data that we were hoping to see!

By the way - it looks like CB 4.5 doesn’t follow N1QL BETWEEN specification - results that are being returned by the querying service don’t include the ending date.

It should have followed. Can you post query and sample document
Also try x BETEEN start AND end ====> x >= start AND x <= end and see if you getting results.

{
“date”: “2017-07-12 22-56-04.648”,
“country”: “…”,
“zipCode”: “…”,
“id”: “…”,
“name”: “…”,
“requests”: 1
}

SELECT country, COUNT(*) AS requests
FROM Request_IDs
WHERE date BETWEEN “2017-07-11” AND "2017-07-12"
AND country='United States’
AND id IS NOT MISSING
GROUP BY country;

NOTE : “date”: “2017-07-12 22-56-04.648”, is NOT ISO 8601 format requires “:” between hour, minute,seconds.

date includes milliseconds so WHERE clause becomes date BETWEEN “2017-07-11 00:00:00.000” AND “2017-07-12 00:00:00.000”

Couchbase SDKs Look Note section below Table 3

If the date string does not explicitly declare the value of a component, then a value of 0 is assumed. For example 2016-02-07 is equivalent to 2016-02-07T00:00:00. This is with the exception of the time zone, which if unspecified in the date string will default to the local system time zone.

Sorry, I didn’t paste the proper date implementation - we do follow the ISO 8601 specification.

For now we are adding 1 more day to the closing date of BETWEEN query - in order to get the data for the day before. We can live with that but it’s worth checking…

Ok.

Above document does not qualify because “2017-07-12 22-56-04.648” BETWEEN “2017-07-11 00:00:00.000” AND “2017-07-12 00:00:00.000” returns false.

If you want to include those change where clause as follows

DATE_FORMAT_STR(date,“1111-11-11”) BETWEEN “2017-07-11” AND “2017-07-12”

Thanks a lot! Will do that!