Count/Aggregation

Full Text seems like a really neat product and very performant over analytics.

Is there a way to do basic aggregation of search matches. meaning …

Show me the average, max and count for example of items/docs with certain keyword.
I get a match of 2 million docs in around 4 seconds … but I really want (as well) capability for basic numbers aggregated

eg Sum of all sales where product_type=‘phone’ in last 24 hours.

@johnfak Couchbase’s full text search will let you know the number of documents that match your search term/key, in other words the count. We also offer a way to do “faceting” to generate counts based on categories.

For aggregation, I would recommend picking up Couchbase’s sql-based query offering (n1ql) in the current version.
In the upcoming release we have supported a way for the user to use full text search indexes with query (which previously only used primary and secondary indexes) - so the user would be able to perform complex joins and aggregations using a mix of indexes.

Thanks.
Problem is it takes 40+ seconds in N1QL and Analytics versus 6 seconds in opensource redis.
N1QL scales terribly.

Search seems much better … but without that aggregation … leaves us back at square one :wink: So to speak.
Product Enhanacement request ?

40+ seconds in N1QL

Did you have a secondary index defined to support the query?

Yes Secondary index.
However its an aggregation of 90% of 45 million json docs.

INDEX
CREATE INDEX typeagg ON bucket(type) WITH {“num_replica”:1 }

QUERY
select type, count(1) from ice_us where type is not missing group by type

QUERY
select type , count(1) from ice_us where type is not missing group by type

For what you’re trying to achieve with this query, there’s a way with FTS. Here’s a facetted full text search request over a default dynamic index or an index where you’ve indexed the field “type” …

curl -XPOST -H "Content-Type: application/json"
http://Administrator:asdasd@127.0.0.1:8094/api/index/default/query -d '
{
  "query": {
    "match_all": {}
  },
  "facets": {
    "type_tokens": {
      "field": "type",
      "size": 10
    }
  }
}
'

This would limit the number of facets generated to the top 10 (descending order of counts).

FYI: If doing N1QL aggregation you should consider EE version with Index Aggregation. https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/

Also Checkout https://blog.couchbase.com/n1ql-and-search-how-to-leverage-fts-index-in-n1ql-query/

We are using EE.
We were told by support - that its a lot of data - and it is what it is.

Thanks I’ll give it a go and repost results …

{“error”:“rest_auth: preparePerms, err: index not found”,“request”:{“facets”:{“type_tokens”:{“field”:“type”,“size”:10}},“query”:{“match_all”:{}}},“status”:“fail”}

Hi @johnfak,

Isn’t it a copy paste error? Did you replace index with the actual fts index name in the curl cmd?

FTS aggregations ends up fetching columnar data from disk while serving faceted queries and the intuition is that it may not keep the latency low enough for ~40M docs. Awaiting your results…

But if not, have you also explored the in-memory covered secondary indexes?

Cheers!

Dont think a copy and paste error just due to I tried it with the JSON curl also direct from the UI … and failed also …
But it might be something with auth or logic as written — so still digging.

My understanding forom re-reading is that MOI is only at entire cluster level - not individual index level ?

Dang … quick.

-bash-4.2$ time curl -u Administrator:********* -XPOST -H “Content-Type: application/json” http://localhost:8094/api/index/faksearch2/query -d ’
{
“query”: {
“match_all”: {}
},
“facets”: {
“type_tokens”: {
“field”: “type”,
“size”: 10
}
}
}
real 0m0.430s
user 0m0.002s
sys 0m0.004s



“total_hits”: 2084816,
“max_score”: 1,
“took”: 422837925,
“facets”: {
“type_tokens”: {
“field”: “type”,
“total”: 1296666,
“missing”: 788150,
“other”: 9,
“terms”: [
{
“term”: “products”,
“count”: 617019
},
{
“term”: “phone”,
“count”: 309668
},
{
“term”: “customer”,
“count”: 143777
},
{
“term”: “addressbooklist”,
“count”: 48893
},
{
“term”: “email”,
“count”: 45561
},
{
“term”: “addressbooklookup”,
“count”: 42036
},
{
“term”: “addressbook”,
“count”: 41891
},
{
“term”: “custguid”,
“count”: 20587
},
{
“term”: “cn”,
“count”: 15694
},
{
“term”: “customerstatus”,
“count”: 11531
}

Only thing is the counts dont match. Products above returns 617019

whereas

select * from (select type, count(1) from ice_us where type is not missing group by type order by 2 desc ) d limit 10
returns

{
“d”: {
“$1”: 4056453,
“type”: “PRODUCTS”
}
},
{
“d”: {
“$1”: 892754,
“type”: “addressbook”
}
},
{
“d”: {
“$1”: 740859,
“type”: “addressbooklist”
}
},
{
“d”: {
“$1”: 894095,
“type”: “addressbooklookup”
}
},
{
“d”: {
“$1”: 3196,
“type”: “agentldap”
}
},
{
“d”: {
“$1”: 15,
“type”: “avtaddrbkrel”
}
},
{
“d”: {
“$1”: 2,
“type”: “avtcountry”
}
},
{
“d”: {
“$1”: 4,
“type”: “avtcustaddrbk”
}
},
{
“d”: {
“$1”: 1,
“type”: “avtcustnicknamedisp”
}
},
{
“d”: {
“$1”: 24,
“type”: “avtcustomerstatus”
}
}
]

Yes you were correct - I missed that - thanks

Counts might not match due to index build not 100% … seems slow … I had assumed in had finished as just a single attribute (type) … so I dropped and now recreating.

Thats true, its a cluster level feature!

Yeah that would be hard to do on an existing production cluster … also really just trying to satisfy 1-2 queries doing analysis/aggregation across data … so dont want all indexes fully occupied in memory.

Ill retry search once the index build completes … not sure why its so slow compared to normal N1QL index build.
CPU coreonly at 30% across …but top process is cbft process

01:09:04 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:09:09 PM all 28.85 0.00 3.80 0.25 0.00 0.15 0.00 0.00 0.00 66.94
01:09:09 PM 0 32.66 0.00 3.04 0.41 0.00 0.20 0.00 0.00 0.00 63.69
01:09:09 PM 1 27.46 0.00 3.28 0.00 0.00 0.20 0.00 0.00 0.00 69.06
01:09:09 PM 2 29.35 0.00 4.25 0.40 0.00 0.20 0.00 0.00 0.00 65.79
01:09:09 PM 3 26.01 0.00 4.44 0.00 0.00 0.20 0.00 0.00 0.00 69.35

top - 13:07:45 up 1 day, 18:12, 1 user, load average: 4.98, 3.82, 3.51
Tasks: 227 total, 1 running, 226 sleeping, 0 stopped, 0 zombie
%Cpu(s): 38.5 us, 5.9 sy, 0.0 ni, 55.1 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 16266540 total, 2174136 free, 8882112 used, 5210292 buff/cache
KiB Swap: 3903484 total, 3859700 free, 43784 used. 7029912 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20942 couchba+ 20 0 7040148 1.6g 46148 S 99.0 10.3 100:06.93 cbft
7370 couchba+ 20 0 6790280 3.9g 4636 S 30.2 25.0 452:54.22 memcached
3456 couchba+ 20 0 4271824 892216 3444 S 29.2 5.5 1709:36 beam.smp
20868 couchba+ 20 0 2920604 524168 11756 S 12.3 3.2 33:42.22 indexer
2250 root 20 0 555304 18244 7168 S 2.3 0.1 51:44.11 BESClient
20760 couchba+ 20 0 828764 162036 23668 S 1.3 1.0 1:22.55 cbq-engine
20889 couchba+ 20 0 1073592 403788 6380 S 1.3 2.5 3:52.11 projector
1871 couchba+ 20 0 3318164 55916 3160 S 0.7 0.3 377:01.21 beam.smp

@johnfak Glad the faceted full text search request did the trick for you. Regarding slow building FTS indexes - how have you defined your FTS index. We could try fine tuning it if you haven’t already. Would you share your index definition here?

{
“type”: “fulltext-index”,
“name”: “faksearch2”,
“uuid”: “692cd72a277a620d”,
“sourceType”: “couchbase”,
“sourceName”: “ice_us”,
“sourceUUID”: “021160cf87998bf9e4dc96303a90a13d”,
“planParams”: {
“maxPartitionsPerPIndex”: 171,
“indexPartitions”: 6
},
“params”: {
“doc_config”: {
“docid_prefix_delim”: “”,
“docid_regexp”: “”,
“mode”: “type_field”,
“type_field”: “type”
},
“mapping”: {
“analysis”: {},
“default_analyzer”: “standard”,
“default_datetime_parser”: “dateTimeOptional”,
“default_field”: “_all”,
“default_mapping”: {
“dynamic”: true,
“enabled”: true
},
“default_type”: “_default”,
“docvalues_dynamic”: true,
“index_dynamic”: true,
“store_dynamic”: false,
“type_field”: “_type”
},
“store”: {
“indexType”: “scorch”
}
},
“sourceParams”: {}
}

Ok, so looks like you’re creating a default dynamic index, meaning you’re indexing everything, all fields within all documents. Seeing that the “type” field is the only one of interest to execute the faceted query, lets change the index definition to index only the “type” field in the documents (where available). Here’s how your index definition would look after, I expect this to build a whole lot faster.

{
  "name": "faksearch2",
  "type": "fulltext-index",
  "uuid": "",
  "sourceType": "couchbase",
  "sourceName": "ice_us",
  "sourceUUID": "021160cf87998bf9e4dc96303a90a13d",
  "sourceParams": {},
  "planParams": {
    "maxPartitionsPerPIndex": 171,
    "numReplicas": 0,
    "indexPartitions": 6
  },
  "params": {
    "mapping": {
      "default_mapping": {
        "enabled": true,
        "dynamic": false,
        "properties": {
          "type": {
            "enabled": true,
            "dynamic": false,
            "fields": [
              {
                "name": "type",
                "type": "text",
                "store": false,
                "index": true,
                "include_term_vectors": true,
                "include_in_all": true,
                "docvalues": true
              }
            ]
          }
        }
      },
      "default_type": "_default",
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "store_dynamic": false,
      "index_dynamic": true
    },
    "store": {
      "indexType": "scorch"
    },
    "doc_config": {
      "mode": "type_field",
      "type_field": "type",
      "docid_prefix_delim": "",
      "docid_regexp": ""
    }
  }
}