Count/Aggregation

johnfak · January 9, 2020, 9:54pm

Full Text seems like a really neat product and very performant over analytics.

Is there a way to do basic aggregation of search matches. meaning …

Show me the average, max and count for example of items/docs with certain keyword.
I get a match of 2 million docs in around 4 seconds … but I really want (as well) capability for basic numbers aggregated
…

eg Sum of all sales where product_type=‘phone’ in last 24 hours.

abhinav · January 9, 2020, 10:14pm

@johnfak Couchbase’s full text search will let you know the number of documents that match your search term/key, in other words the count. We also offer a way to do “faceting” to generate counts based on categories.

For aggregation, I would recommend picking up Couchbase’s sql-based query offering (n1ql) in the current version.
In the upcoming release we have supported a way for the user to use full text search indexes with query (which previously only used primary and secondary indexes) - so the user would be able to perform complex joins and aggregations using a mix of indexes.

johnfak · January 9, 2020, 10:16pm

Thanks.
Problem is it takes 40+ seconds in N1QL and Analytics versus 6 seconds in opensource redis.
N1QL scales terribly.

Search seems much better … but without that aggregation … leaves us back at square one So to speak.
Product Enhanacement request ?

abhinav · January 9, 2020, 10:28pm

40+ seconds in N1QL

Did you have a secondary index defined to support the query?

johnfak · January 9, 2020, 10:30pm

Yes Secondary index.
However its an aggregation of 90% of 45 million json docs.

INDEX
CREATE INDEX typeagg ON bucket(type) WITH {“num_replica”:1 }

QUERY
select type, count(1) from ice_us where type is not missing group by type

abhinav · January 9, 2020, 10:46pm

QUERY
select type , count(1) from ice_us where type is not missing group by type

For what you’re trying to achieve with this query, there’s a way with FTS. Here’s a facetted full text search request over a default dynamic index or an index where you’ve indexed the field “type” …

curl -XPOST -H "Content-Type: application/json"
http://Administrator:asdasd@127.0.0.1:8094/api/index/default/query -d '
{
  "query": {
    "match_all": {}
  },
  "facets": {
    "type_tokens": {
      "field": "type",
      "size": 10
    }
  }
}
'

This would limit the number of facets generated to the top 10 (descending order of counts).

vsr1 · January 9, 2020, 10:52pm

FYI: If doing N1QL aggregation you should consider EE version with Index Aggregation. https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/

Also Checkout https://blog.couchbase.com/n1ql-and-search-how-to-leverage-fts-index-in-n1ql-query/

johnfak · January 10, 2020, 2:38pm

We are using EE.
We were told by support - that its a lot of data - and it is what it is.

johnfak · January 10, 2020, 2:38pm

Thanks I’ll give it a go and repost results …

johnfak · January 10, 2020, 4:27pm

{“error”:“rest_auth: preparePerms, err: index not found”,“request”:{“facets”:{“type_tokens”:{“field”:“type”,“size”:10}},“query”:{“match_all”:{}}},“status”:“fail”}

sreeks · January 10, 2020, 4:34pm

Hi @johnfak,

Isn’t it a copy paste error? Did you replace index with the actual fts index name in the curl cmd?

FTS aggregations ends up fetching columnar data from disk while serving faceted queries and the intuition is that it may not keep the latency low enough for ~40M docs. Awaiting your results…

But if not, have you also explored the in-memory covered secondary indexes?

Cheers!

johnfak · January 10, 2020, 4:36pm

Dont think a copy and paste error just due to I tried it with the JSON curl also direct from the UI … and failed also …
But it might be something with auth or logic as written — so still digging.

My understanding forom re-reading is that MOI is only at entire cluster level - not individual index level ?

johnfak · January 10, 2020, 5:15pm

Dang … quick.

-bash-4.2$ time curl -u Administrator:********* -XPOST -H “Content-Type: application/json” http://localhost:8094/api/index/faksearch2/query -d ’
{
“query”: {
“match_all”: {}
},
“facets”: {
“type_tokens”: {
“field”: “type”,
“size”: 10
}
}
}
real 0m0.430s
user 0m0.002s
sys 0m0.004s

…
…
“total_hits”: 2084816,
“max_score”: 1,
“took”: 422837925,
“facets”: {
“type_tokens”: {
“field”: “type”,
“total”: 1296666,
“missing”: 788150,
“other”: 9,
“terms”: [
{
“term”: “products”,
“count”: 617019
},
{
“term”: “phone”,
“count”: 309668
},
{
“term”: “customer”,
“count”: 143777
},
{
“term”: “addressbooklist”,
“count”: 48893
},
{
“term”: “email”,
“count”: 45561
},
{
“term”: “addressbooklookup”,
“count”: 42036
},
{
“term”: “addressbook”,
“count”: 41891
},
{
“term”: “custguid”,
“count”: 20587
},
{
“term”: “cn”,
“count”: 15694
},
{
“term”: “customerstatus”,
“count”: 11531
}

Only thing is the counts dont match. Products above returns 617019

whereas

select * from (select type, count(1) from ice_us where type is not missing group by type order by 2 desc ) d limit 10
returns

{
“d”: {
“$1”: 4056453,
“type”: “PRODUCTS”
}
},
{
“d”: {
“$1”: 892754,
“type”: “addressbook”
}
},
{
“d”: {
“$1”: 740859,
“type”: “addressbooklist”
}
},
{
“d”: {
“$1”: 894095,
“type”: “addressbooklookup”
}
},
{
“d”: {
“$1”: 3196,
“type”: “agentldap”
}
},
{
“d”: {
“$1”: 15,
“type”: “avtaddrbkrel”
}
},
{
“d”: {
“$1”: 2,
“type”: “avtcountry”
}
},
{
“d”: {
“$1”: 4,
“type”: “avtcustaddrbk”
}
},
{
“d”: {
“$1”: 1,
“type”: “avtcustnicknamedisp”
}
},
{
“d”: {
“$1”: 24,
“type”: “avtcustomerstatus”
}
}
]

johnfak · January 10, 2020, 5:15pm

Yes you were correct - I missed that - thanks

johnfak · January 10, 2020, 5:54pm

Counts might not match due to index build not 100% … seems slow … I had assumed in had finished as just a single attribute (type) … so I dropped and now recreating.

sreeks · January 10, 2020, 6:04pm

Thats true, its a cluster level feature!

johnfak · January 10, 2020, 6:09pm

Yeah that would be hard to do on an existing production cluster … also really just trying to satisfy 1-2 queries doing analysis/aggregation across data … so dont want all indexes fully occupied in memory.

Ill retry search once the index build completes … not sure why its so slow compared to normal N1QL index build.
CPU coreonly at 30% across …but top process is cbft process

01:09:04 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:09:09 PM all 28.85 0.00 3.80 0.25 0.00 0.15 0.00 0.00 0.00 66.94
01:09:09 PM 0 32.66 0.00 3.04 0.41 0.00 0.20 0.00 0.00 0.00 63.69
01:09:09 PM 1 27.46 0.00 3.28 0.00 0.00 0.20 0.00 0.00 0.00 69.06
01:09:09 PM 2 29.35 0.00 4.25 0.40 0.00 0.20 0.00 0.00 0.00 65.79
01:09:09 PM 3 26.01 0.00 4.44 0.00 0.00 0.20 0.00 0.00 0.00 69.35

top - 13:07:45 up 1 day, 18:12, 1 user, load average: 4.98, 3.82, 3.51
Tasks: 227 total, 1 running, 226 sleeping, 0 stopped, 0 zombie
%Cpu(s): 38.5 us, 5.9 sy, 0.0 ni, 55.1 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 16266540 total, 2174136 free, 8882112 used, 5210292 buff/cache
KiB Swap: 3903484 total, 3859700 free, 43784 used. 7029912 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20942 couchba+ 20 0 7040148 1.6g 46148 S 99.0 10.3 100:06.93 cbft
7370 couchba+ 20 0 6790280 3.9g 4636 S 30.2 25.0 452:54.22 memcached
3456 couchba+ 20 0 4271824 892216 3444 S 29.2 5.5 1709:36 beam.smp
20868 couchba+ 20 0 2920604 524168 11756 S 12.3 3.2 33:42.22 indexer
2250 root 20 0 555304 18244 7168 S 2.3 0.1 51:44.11 BESClient
20760 couchba+ 20 0 828764 162036 23668 S 1.3 1.0 1:22.55 cbq-engine
20889 couchba+ 20 0 1073592 403788 6380 S 1.3 2.5 3:52.11 projector
1871 couchba+ 20 0 3318164 55916 3160 S 0.7 0.3 377:01.21 beam.smp

abhinav · January 10, 2020, 6:35pm

@johnfak Glad the faceted full text search request did the trick for you. Regarding slow building FTS indexes - how have you defined your FTS index. We could try fine tuning it if you haven’t already. Would you share your index definition here?

johnfak · January 10, 2020, 6:59pm

{
“type”: “fulltext-index”,
“name”: “faksearch2”,
“uuid”: “692cd72a277a620d”,
“sourceType”: “couchbase”,
“sourceName”: “ice_us”,
“sourceUUID”: “021160cf87998bf9e4dc96303a90a13d”,
“planParams”: {
“maxPartitionsPerPIndex”: 171,
“indexPartitions”: 6
},
“params”: {
“doc_config”: {
“docid_prefix_delim”: “”,
“docid_regexp”: “”,
“mode”: “type_field”,
“type_field”: “type”
},
“mapping”: {
“analysis”: {},
“default_analyzer”: “standard”,
“default_datetime_parser”: “dateTimeOptional”,
“default_field”: “_all”,
“default_mapping”: {
“dynamic”: true,
“enabled”: true
},
“default_type”: “_default”,
“docvalues_dynamic”: true,
“index_dynamic”: true,
“store_dynamic”: false,
“type_field”: “_type”
},
“store”: {
“indexType”: “scorch”
}
},
“sourceParams”: {}
}

abhinav · January 10, 2020, 7:16pm

Ok, so looks like you’re creating a default dynamic index, meaning you’re indexing everything, all fields within all documents. Seeing that the “type” field is the only one of interest to execute the faceted query, lets change the index definition to index only the “type” field in the documents (where available). Here’s how your index definition would look after, I expect this to build a whole lot faster.

{
  "name": "faksearch2",
  "type": "fulltext-index",
  "uuid": "",
  "sourceType": "couchbase",
  "sourceName": "ice_us",
  "sourceUUID": "021160cf87998bf9e4dc96303a90a13d",
  "sourceParams": {},
  "planParams": {
    "maxPartitionsPerPIndex": 171,
    "numReplicas": 0,
    "indexPartitions": 6
  },
  "params": {
    "mapping": {
      "default_mapping": {
        "enabled": true,
        "dynamic": false,
        "properties": {
          "type": {
            "enabled": true,
            "dynamic": false,
            "fields": [
              {
                "name": "type",
                "type": "text",
                "store": false,
                "index": true,
                "include_term_vectors": true,
                "include_in_all": true,
                "docvalues": true
              }
            ]
          }
        }
      },
      "default_type": "_default",
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "store_dynamic": false,
      "index_dynamic": true
    },
    "store": {
      "indexType": "scorch"
    },
    "doc_config": {
      "mode": "type_field",
      "type_field": "type",
      "docid_prefix_delim": "",
      "docid_regexp": ""
    }
  }
}

Topic		Replies	Views
Indexing and Search on Huge Bucket Analytics n1ql , view	4	2457	May 4, 2018
Group By Type slow Analytics	3	420	November 15, 2023
Slow couchbase %like% Full Text Search n1ql	7	1367	November 11, 2019
How does Couchbase perform with aggregations? Couchbase Server	4	3677	April 20, 2017
Slow count query SQL++	7	1559	December 19, 2018

Count/Aggregation

Related topics