Improve count query - Use IndexCountScan

Hi all,

I have count query that I want to improve (currently the response time is ~27s with 3M of records):

SELECT COUNT(DISTINCT `subject`) AS count FROM block WHERE `subject` IS NOT MISSING AND `_class` = "com.xxx.yyy.zzz.model.Block" AND `type` IS NOT MISSING AND `type` IN ['http://my.host.com/xxx/yyy/core']

The explain show a covering index:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "IndexScan",
          "covers": [
            "cover ((`block`.`subject`))",
            "cover ((`block`.`type`))",
            "cover ((`block`.`lastModification`))",
            "cover ((meta(`block`).`id`))"
          ],
          "filter_covers": {
            "cover ((`block`.`_class`))": "com.xxx.yyy.zzz.model.Block"
          },
          "index": "i_subject_type_lastModification_class",
          "index_id": "36a2933006736836",
          "keyspace": "block",
          "namespace": "default",
          "spans": [
            {
              "Range": {
                "Inclusion": 0,
                "Low": [
                  "null",
                  "\"http://my.host.com/xxx/yyy/core\""
                ]
              }
            }
          ],
          "using": "gsi"
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Filter",
                "condition": "((((cover ((`block`.`subject`)) is not missing) and (cover ((`block`.`_class`)) = \"com.xxx.yyy.zzz.model.Block\")) and (cover ((`block`.`type`)) is not missing)) and (cover ((`block`.`type`)) in [\"http://my.host.com/xxx/yyy/core\"]))"
              },
              {
                "#operator": "InitialGroup",
                "aggregates": [
                  "count(distinct cover ((`block`.`subject`)))"
                ],
                "group_keys": []
              }
            ]
          }
        },
        {
          "#operator": "IntermediateGroup",
          "aggregates": [
            "count(distinct cover ((`block`.`subject`)))"
          ],
          "group_keys": []
        },
        {
          "#operator": "FinalGroup",
          "aggregates": [
            "count(distinct cover ((`block`.`subject`)))"
          ],
          "group_keys": []
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "InitialProject",
                "result_terms": [
                  {
                    "as": "count",
                    "expr": "count(distinct cover ((`block`.`subject`)))"
                  }
                ]
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        }
      ]
    },
    "text": "\nSELECT \nCOUNT(DISTINCT `subject`) AS count \nFROM block WHERE `subject` IS NOT MISSING AND `_class` = \"com.xxx.yyy.zzz.model.Block\" \nAND `type` IS NOT MISSING AND `type` IN ['http://my.host.com/xxx/yyy/core']"
  }
]

I have the following index:

CREATE INDEXi_subject_type_lastModification_classONblock(subject,type,lastModification) WHERE (_class= "com.xxx.yyy.zzz.model.Block")

The stored documents have the following structure:

{
  "creator": "system",
  "subject": "http://my.host.com/data/subjects/259",
  "_class": "com.xxx.yyy.zzz.model.Block",
  "type": "http://my.host.com/xxx/yyy/core",
  "body": {
    <body-can-have-a-random-json-object>
  },
  "creationDate": 1474400777951
}

I wonder if its possible to use a IndexCountScan in this query. How can I do that?
Thanks!

Hi,

COUNT with DISTINCT does not use IndexCountScan.

You are searching for a single type, so you should use type = … instead of type IN … .

You can also add the type to the index predicate. And you can remove type and lastModification from the index keys.

This query will not able to use IndexCountScan due to DISTINCT.

The following are right index and query.

CREATE INDEX ix1 (subject) WHERE (_class= “com.xxx.yyy.zzz.model.Block” AND type = ‘http://my.host.com/xxx/yyy/core’ )

SELECT COUNT(DISTINCT subject) AS count FROM block WHERE subject IS NOT MISSING AND _class = “com.xxx.yyy.zzz.model.Block” AND type = ‘http://my.host.com/xxx/yyy/core’;

If the type going to change for different query.

CREATE INDEX ix1 (type, subject) WHERE (_class= “com.xxx.yyy.zzz.model.Block” )

SELECT COUNT(DISTINCT subject) AS count FROM block WHERE subject IS NOT MISSING AND _class = “com.xxx.yyy.zzz.model.Block” AND type IN [ ‘http://my.host.com/xxx/yyy/core’];

If there is only a single type, the second query should also use type = … .

@vsr1/@geraldss thanks for the advice.

I didn’t know about the DISTINCT and the IndexCountScan.

The reason I am using IN is because the user can specify multiple types. Maybe I can improve that part based on the number of types the user pass in and generate IN or = based on that.

On the other hand I added the lastModification field to the index key because in a subsequent query I am asking for the lastMofication field:

SELECTsubjectAS subject, MIN(lastModification) AS modified FROM block WHEREsubjectIS NOT MISSING AND_class= "com.xxx.yyy.zzz.model.Block" ANDtypeIS NOT MISSING ANDtypeIN ['http://my.host.com/xxx/yyy/core'] GROUP BYsubjectORDER BYmodifiedASC LIMIT 1000 OFFSET 0

These two queries are part of a same request. The count query is because is a paginated request.
Thanks!

Makes sense. If you post all the related queries, we can propose a set of indexes.

1 Like

Hi @geraldss , thanks for your help

I have 2 types of request, that end up executing 2 queries each in CB (a count and a select - since the request is paginated).

REQUEST TYPE 1.-

In this request type the type field, subject and lastModification are optional, so the user can specify any of those 3 filters.

Since we are using spring data, _class field is always included (but we can remove it from the query if that improves the performance)

count:

SELECT COUNT(*) AS count 
FROM `block` 
WHERE 1 = 1 
AND `_class` = "com.xxx.yyy.zzz.model.Block"

AND `type` IS NOT MISSING 
AND `type` = "http://my.host.com/types/type1" // can also be `type` IN [...] depennding on the number of types the user provide we use one or another.

AND `subject` IS NOT MISSING 
AND `subject` = "http://my.host.com/data/subjects/subject1" 

AND `lastModification` >= 1414551076000 

query:

SELECT `type`, `subject`, `lastModification` AS modified, `body` 
FROM `block` 
WHERE 1 = 1 
AND `_class` = "com.xxx.yyy.zzz.model.Block"

AND `type` IS NOT MISSING 
AND `type` = "http://my.host.com/types/type1" // can also be `type` IN [...] depennding on the number of types the user provide we use one or another.

AND `subject` IS NOT MISSING 
AND `subject` = "http://my.host.com/data/subjects/subject1"

AND `lastModification` >= 1414551076000 
 
ORDER BY `lastModification` ASC, `subject` ASC LIMIT 1000 OFFSET 0

For this type of request, the index I have created is:

CREATE INDEX `i_subject_type_lastModification_class` 
ON`block`(`subject`,`type`,`lastModification`) 
WHERE (`_class` = "com.xxx.yyy.zzz.model.Block")

REQUEST TYPE 2.-

On this other type of request, the response is different so we only need the subject and the MIN(lastModification). Also here, the user can specify 2 other filters, one is the search by subject LIKE, and the second one is by fields in the body.
I am not worried about the performance of the last two filters since I know that they can be pretty expensive and the use user is not likely to search using those filters. It doesn’t make sense to index by the body field since that field can contain a random JSON object.
I was thinking that I can maybe use subqueries in these cases.???

What I would like to improve is the query by type and lastModification. In this case the user can specify only the type, only the lastModification or both.

A question that I have related with the lastModification field (Date) is what is the best data type that I should use for that field. In my case we are storing the Date as a long but went we search by lastModification, the query is not performing as I expected, even though the explain shows a covered index.

count:

SELECT COUNT(DISTINCT `subject`) AS count 
FROM block 
WHERE 1 = 1 
AND `_class` = "com.xxx.yyy.zzz.model.Block" 

AND `type` IS NOT MISSING 
AND `type` = "http://my.host.com/types/type1" // can also be `type` IN [...] depennding on the number of types the user provide we use one or another.

AND `lastModification` IS NOT MISSING
AND `lastModification` >= 1414551076000 

AND `subject` IS NOT MISSING 
AND `subject` LIKE "%subject1%"

AND `body`.`fieldXYZ` = "value123"

query:

SELECT `subject` AS subject, MIN(`lastModification`) AS modified 
FROM block 
WHERE 1 = 1
AND `_class` = "com.xxx.yyy.zzz.model.Block"

AND `type` IS NOT MISSING 
AND `type` = "http://my.host.com/types/type1" // can also be `type` IN [...] depennding on the number of types the user provide we use one or another.

AND `lastModification` IS NOT MISSING 
AND `lastModification` >= 1414551076000 

AND `subject` IS NOT MISSING 
AND `subject` LIKE "%subject1%"

AND `body`.`fieldXYZ` = "value123" 

GROUP BY `subject` ORDER BY `modified` ASC LIMIT 1000 OFFSET 0

For this type of request the Index I have created is:

CREATE INDEX `i_subject_lastModification_class` 
ON `block`(`subject`,`lastModification`) 
WHERE (`_class` = "com.xxx.yyy.zzz.model.Block")

The stored documents have the following structure:

{
  "_class": "com.xxx.yyy.zzz.model.Block",
  "creator": "system",
  "creationDate": 1474400777951,
  "lastModifiedBy" : "system",
  "lastModification" : 1474400777951,
  "subject": "http://my.host.com/data/subjects/subject1",
  "type": "http://my.host.com/xxx/yyy/core",
  "body": {
    <body-can-have-a-random-json-object>
  }
}

If you can recommend a set of indexes for those 2 types of queries or maybe how to re-write them in a better way, it will be fantastic!

Thanks!.
M.-

@geraldss do you have solved the IndexCountScan like problem in version 5.1.

There are improvements coming in 5.5.0 DP released today and you should try and let us know.

thanks, I will tell you the result.

Hi @ex.zhengshaoqiong,

5.5.0 DP available to download. Check Index Grouping and Aggregations

@vsr1
when where condition is like %钢构厂房_32%,i also can’t use indexCountScan in version 5.5

{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan3”,
“covers”: [
“cover ((business.cate_id))”,
“cover ((business.area_id))”,
“cover ((business.tag))”,
“cover ((business.-vip))”,
“cover ((business.-id))”,
“cover ((meta(business).id))”
],
“index”: “idx_cate_area_tag_id_vip”,
“index_id”: “bb5852e9b9e1f104”,
“index_projection”: {
“entry_keys”: [
0,
1,
2
]
},
“keyspace”: “business”,
“namespace”: “default”,
“spans”: [
{
“range”: [
{
“high”: “2”,
“inclusion”: 3,
“low”: “2”
},
{
“high”: “109”,
“inclusion”: 3,
“low”: “109”
},
{
“high”: “[]”,
“inclusion”: 1,
“low”: “”""
}
]
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “((((1 = 1) and (cover ((business.area_id)) = 109)) and (cover ((business.cate_id)) = 2)) and (cover ((business.tag)) like “%钢构厂房_32%”))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
"count(
)”
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: "count(
)”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: “select count(*) from business where 1=1 and area_id = 109 and cate_id = 2 and tag like ‘%钢构厂房_32%’;”
}

IndexCountScan or any pushdown needs to happen the all the predicates must able to converted to exact range scans and passed to indexer (Query should not be able eliminate any rows further). Otherwise results will not correct. In your case Leading % can’t be converted to range scan. So it will not do IndexCountScan.