Count(*) is super slow, and index across multiple document types is not fast enough

flaviu · February 12, 2021, 1:55pm

oh my god, I am going to throw my computer out of the window

I cannot make this work no matter what I do.

I have multiple documents (200 million) of the format: (filtered by app_primary_genre_id = 1 - will result in 200k documents )

{
“country_code”: “us”,
“sub_type”: “app”,
“redacted”: {
“app_name”: “dark war”,
“app_primary_genre_id”: 1,
"app_provider_uuid":  "eecefa65-12bd-45d5-93b8-eef72c83ed24",
"release_date": 1234,
"user_rating": 4.23
}
“key”: “meta:bc9f7bc7-ad0a-4f91-a213-357872d34d8c:is”,
“created_timestamp”: 1602820362,
“type”: “tracker-app”,
“version”: 2,
"app_uuid": "bc9f7bc7-ad0a-4f91-a213-357872d34d8c",
“updated_timestamp”: 1611807733
}

and another (100k) of format:

{
“sub_type”: “app-extra”,
“key”: “meta:bc9f7bc7-ad0a-4f91-a213-357872d34d8c:e:1”,
“created_timestamp”: 1602820362,
“type”: “tracker-app”,
"i_genre_id": "1001",
“version”: 2,
“updated_timestamp”: 1611807733
}

I am trying to make a query to join and filter these two type of documents:

SELECT META(a).id,
   FROM `bucket` a LEFT
        JOIN `bucket` e ON a.app_uuid = e.app_uuid
        AND e.sub_type="app-extra"
    WHERE a.sub_type="app"
        AND a.country_code = "us"
        AND a.redacted.app_primary_genre_id = 1
        AND a.redacted.app_provider_uuid= "eecefa65-12bd-45d5-93b8-eef72c83ed24" /* (this is optional) */
        AND a.redacted.release_date <=  1234  /* (this is optional) */
        AND a.redacted.user_rating <=  3  /* (this is optional) */
        AND a.redacted.app_name LIKE "%dark%"  /* (this is optional) */
        AND e.i_genre_id LIKE "2%"  /* (this is optional) */

I have created all the recommended queries (including the covering ones), the results (8 in total) of this query takes 2 seconds. Which is unacceptably high, but If I do a count(*)

 SELECT count(*)
    FROM `bucket` a LEFT
        JOIN `bucket` e ON a.app_uuid = e.app_uuid
        AND e.sub_type="app-extra"
    WHERE a.sub_type="app"
        AND a.country_code = "us"
        AND a.redacted.app_primary_genre_id = 1
        AND a.redacted.app_provider_uuid= "eecefa65-12bd-45d5-93b8-eef72c83ed24" /* (this is optional) */
        AND a.redacted.release_date <=  1234  /* (this is optional) */
        AND a.redacted.user_rating <=  3  /* (this is optional) */
        AND a.redacted.app_name LIKE "%dark%"  /* (this is optional) */
        AND e.i_genre_id LIKE "2%"  /* (this is optional) */

I get the results in tens of minutes, so, clearly, I am doing something wrong.

I have created so many indexes (lost count), based on your https://index-advisor.couchbase.com/indexadvisor/ but also on the index recommendations in the query result page (“Advice” tab). I don’t know what else to do, this is not working!!

I am trying to use this type of queries on a paginated search results page for a web application. So, I will also need to implement some kind of pagination

I don’t understand what is wrong and how can this be improved, but I feel I am going crazy.

Is it possible to get anybody from this forum to pay to help me with all these queries? (probably will be cheaper than madness medication)

vsr1 · February 12, 2021, 3:16pm

CREATE INDEX ix1 ON `bucket`(country_code, redacted.app_primary_genre_id , redacted.app_provider_uuid,
      redacted.release_date,  redacted.user_rating, redacted.app_name , app_uuid)
WHERE sub_type = "app";
CREATE INDEX ix2 ON `bucket` ( app_uuid, i_genre_id ) WHERE sub_type = "app-extra";

flaviu · February 12, 2021, 5:07pm

Thank you @vsr1

I have created the indexes per your suggestion (ix1111, and ix222):

but when I do a count (*) without filters ( the number of results is around 200k) it stays 10 minutes to do the count

SELECT count(*)
FROM `bucket` a USE INDEX (ix1111 )LEFT
    JOIN `bucket` e USE INDEX (ix2222 ) ON a.app_uuid = e.app_uuid
    AND e.sub_type="app-extra"
WHERE a.sub_type="app"
    AND a.country_code = "us"
    AND a.redacted.app_primary_genre_id = 1

also as I continue to navigateincreasign the OFFSE and keeping same LIMIT the response time is starting to go up with 300 ms for every additional 100 result on the OFFSET

if I try to order based on meta(a).id so that I could do the pagination using where meta(a).id > xxxxxyyyy the ordering takes minutes, so, again this is not a good solution.

What is the recommended solution for this?

I was thinking to put a query timeout on the count(*) query and if the query takes longer than x milliseconds to run just to consider that the number of the result is bigger than let’s say 10k, and as I navigate (through multiple pages) I can just update this total number until the result are smaller than the limit, and that’s how I could know when I am at the end of the list. But in order for this to work, I need to make the pagination to be fast. Do you have any recommendations on how to make the pagination fast?

vsr1 · February 12, 2021, 5:14pm

There is no cursors it needs to restart from beginning. No easy solution Using OFFSET and Keyset in N1QL | The Couchbase Blog

flaviu · February 12, 2021, 6:59pm

This is a very old problem, and all the websites have soeme kind o pagination. You guys really need to nail this down at some point, as this is a BIG - BIG isue for people trying to use Couchbase on web projects…

Topic		Replies	Views
Join query very slow SQL++	4	2423	January 29, 2016
Creating Index when 2 documents having array fields as predicates are joined SQL++ n1ql	6	1185	October 14, 2017
Optimising count query SQL++ query	2	623	May 2, 2019
Multiple Index JOIN from a single document SQL++ n1ql	6	1196	November 30, 2018
Slow index join SQL++ n1ql , index	6	1844	December 19, 2016

Count(*) is super slow, and index across multiple document types is not fast enough

Related topics