Count query is very slow and returns 0 some times even with indexes and IndexCountScan

We see the count query is very slow (usually 10 seconds) and returning 0 result some times. This is causing both performance issues and inconsistent results in our application.

Couchbase version : Community Edition 6.0.0 build 1693
Node configuration: 7 nodes, 16 core CPU , 16GB memory , SSD.
Number of documents in the bucket at any time: 25K
There are lot of documents gets deleted and created every second.

Indexes:
CREATE INDEX index1 ON bucket1(_type,time,intType,(meta().cas)) WHERE ((state = “QUEUED”) and (intType = “ASYNC”)) WITH { “nodes”:[ “cache01.xxx.net:8091” ] }
CREATE INDEX index2 ON bucket1(_type,time,intType,(meta().cas)) WHERE ((state = “QUEUED”) and (intType = “ASYNC”)) WITH { “nodes”:[ “cache02.xxx.net:8091” ] }

Query:
SELECT COUNT(*) AS count FROM bucket1 WHERE _type =“queue-2” and state=“QUEUED” and intType = “ASYNC”

Explain:
{
“plan”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “IndexCountScan2”,
“covers”: [
“cover ((bucket1._type))”,
“cover ((bucket1.time))”,
“cover ((bucket1.intType))”,
“cover ((meta(bucket1).cas))”,
“cover ((meta(bucket1).id))”
],
“filter_covers”: {
“cover ((bucket1.intType))”: “ASYNC”,
“cover ((bucket1.state))”: “QUEUED”
},
“index”: “type_time_chattype_3”,
“index_id”: “52b43bbcabe21bf5”,
“keyspace”: “bucket1”,
“namespace”: “default”,
“spans”: [
{
“exact”: true,
“range”: [
{
“high”: "“queue-2"”,
“inclusion”: 3,
“low”: "“queue-2"”
}
]
}
],
“using”: “gsi”
},
{
#operator”: “IndexCountProject”,
“result_terms”: [
{
“as”: “count”,
“expr”: “count()"
}
]
}
]
},
“text”: "SELECT COUNT(
) AS count FROM bucket1 WHERE _type =“queue-2” and state=“QUEUED” and intType = “ASYNC””
}

Observations:

  • Count query returns 0 some times (for a period of 15 minutes continously)
  • Count query takes more than 10 seconds most of the time
  • index data size (upto 80 MB) and index disk size (upto 20 MB) gradually increases over a period of time
  • index average item size also increases over a period of time (from 2K it went upto 20K)
  • Nodes where this index is present the CPU utilization is gradually increases and finally reaching 100% causing the node to not respond properly.
  • SInce we use cicrular write index, at zero UTC compaction happens which brings everything under control (CPU, index data and disk size and average item size)
  • Manually triggering compaction did not help
  • When the CPU is at 100%, dropping the index and re-creating them again bring back the node to normal state (but there is an impact in the application due to that for some time).

Guidance here is much appreciated and seeking for couchbase expertise here. I would be happy to share more details if needed.

The plan looks right. The count is performed by indexer and query service just report final count. cc @varun.velamuri