How does Couchbase perform with aggregations?

jonathan.sandusky · April 18, 2017, 3:08pm

I’m currently using elasticsearch and we index millions of records every weekend and perform aggregations on them.

Usually in our aggregation we would filter down on attribute 1 and attribute 2 and within a time range. Then we run an aggregation that puts the filtered items into buckets and aggregating those buckets by another attribute to put those into buckets. After all that we sum up the sales in those buckets in buckets. At this point elasticsearch is only returning the aggregations and have lost it’s other details in the document like name, item_code. The only way to get them back is by running another aggregation on the bucket in buckets that returns it’s details. After all this I have to parse through the data and match up the aggregations with the details and present them to the user.

Elasticsearch also has a query limit. It only returns 10000 items and doesn’t offer pagination on aggregations.

How does couchbase deal with all of this. I know that my use case sounds like the ravings of a lunatic so I apologize in advance.

geraldss · April 18, 2017, 4:44pm

Copying @prasad and @keshav_m to respond.

prasad · April 18, 2017, 5:32pm

Hi @jonathan.sandusky, Couchbase supports N1QL, which is a very sophisticated SQL for JSON documents. It is pretty comprehensive in supporting various SQL constructs, and also has extensions to deal with nested & array data (of JSON). For your problem, N1QL supports SELECT with WHERE-clause/LIMIT/OFFSET/GROUP BY/ORDER BY/Subqueries/JOINs and various aggregate functions (such as SUM, COUNT, AVG etc). Couchbase/N1QL also supports Secondary indexes and many Query/N1QL optimizations (including operator pushdowns) to efficiently use the indexes for performance.

Pls post some sample documents, and a pseudo/SQL/results of exactly what you want to achieve. We can help with equivalent N1QL query. You can find more details at https://developer.couchbase.com/documentation/server/4.6/n1ql/n1ql-language-reference/index.html
-Prasad

jonathan.sandusky · April 18, 2017, 6:58pm

Here is an elasticsearch query I’m using to aggregate data. The data is denormalized and is one “table” or document.

A document just has a dept_id, dept_name, sub_dept_id, sub_dept_name, item_id, item_name, location, week_date

Each dept will have about 104 million records max. Currently we have 3 depts. We insert about 3 million documents weekly.

GET weekly_sales/_search
{
  "size": 1, 
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "week_date": {
            "format": "yyyy-MM-dd", 
            "gte": "2016-01-01",
            "lte": "2017-02-21"
          }
          }
        },
        {
          "term": {
            "department_id": "4"
          }
        },
        {
          "term": {
            "sub_dept_id": "20"
          }
        }
      ]
    }
  },
  "aggs": {
    "group_by_location": {
      "terms": {
        "size": 10000,
        "field": "location"
      },
      "aggs": {
        "group_by_item_id": {
            "terms": {
              "size": 10000,
              "field": "item_id"
          },
      "aggs": {
        "total_sales" :{
          "sum": {
            "field": "total_sales"
          }
        },
        "agg_source": {
          "top_hits": {
            "size": 1, 
            "_source": {
              "includes": ["department_id", "dept_name", "sub_dept_id", "sub_dept_name","item_id", "item_name", "location"]
            }
          }
        }
      }
        }
      }
    }
  }
}

My end result is to have something similar to(this is how elasticsearch returns data, couch might return something better or cleaner i hope):

    {
      "grouped_by_location": [
        {    
          {
            "location": 1,
            "grouped_by_item_id: [
              "item_id": 333
              {
                "dept_id":"1",
                "dept_name":"dept1",
                "sub_dept_id":"2",
                "sub_dept_name":"subDeptName",            
                "item_name":"item333",
                "total_sales_sum": 48584943
              }    
            ]
          }
        }
      ]
    }

keshav_m · April 20, 2017, 12:59am

If you have a document like this:

select * from c;

{
      "department_id": "4",
      "dept_id": "5",
      "dept_name": "lego",
      "item_id": 3734,
      "item_name": "EV3",
      "location": 23,
      "sub_dept_id": "501",
      "sub_dept_name": "robots",
      "total_sales": 392.49,
      "week_date": {
        "format": "yyyy-MM-dd",
        "gte": "2016-01-01",
        "lte": "2017-02-21"
      }
    }

You can issue queries like below:

SELECT location, item_id, SUM(total_sales) AS total_sales_sum
FROM c
GROUP BY location, item_id;

you can manipulate doc document set to group, aggregate and order.
N1QL has additional aggregate functions to create/manipulate arrays as well.

Topic		Replies	Views
Indexing and Search on Huge Bucket Analytics n1ql , view	4	2471	May 4, 2018
Should i use ElasticSearch with couchbase 5 SQL++ query , n1ql	1	978	May 7, 2018
How to optimize query performance which aggregate millions of rows SQL++ data_modelling , n1ql , index	3	2520	April 24, 2017
N1ql very low performance?! SQL++ query , n1ql , dot-net	2	1807	March 13, 2016
Optimized my Analytics N1QL query Analytics query , n1ql , analytics	9	1496	September 10, 2020

How does Couchbase perform with aggregations?

Related topics