How does Couchbase perform with aggregations?

I’m currently using elasticsearch and we index millions of records every weekend and perform aggregations on them.

Usually in our aggregation we would filter down on attribute 1 and attribute 2 and within a time range. Then we run an aggregation that puts the filtered items into buckets and aggregating those buckets by another attribute to put those into buckets. After all that we sum up the sales in those buckets in buckets. At this point elasticsearch is only returning the aggregations and have lost it’s other details in the document like name, item_code. The only way to get them back is by running another aggregation on the bucket in buckets that returns it’s details. After all this I have to parse through the data and match up the aggregations with the details and present them to the user.

Elasticsearch also has a query limit. It only returns 10000 items and doesn’t offer pagination on aggregations.

How does couchbase deal with all of this. I know that my use case sounds like the ravings of a lunatic so I apologize in advance.

Copying @prasad and @keshav_m to respond.

1 Like

Hi @jonathan.sandusky, Couchbase supports N1QL, which is a very sophisticated SQL for JSON documents. It is pretty comprehensive in supporting various SQL constructs, and also has extensions to deal with nested & array data (of JSON). For your problem, N1QL supports SELECT with WHERE-clause/LIMIT/OFFSET/GROUP BY/ORDER BY/Subqueries/JOINs and various aggregate functions (such as SUM, COUNT, AVG etc). Couchbase/N1QL also supports Secondary indexes and many Query/N1QL optimizations (including operator pushdowns) to efficiently use the indexes for performance.

Pls post some sample documents, and a pseudo/SQL/results of exactly what you want to achieve. We can help with equivalent N1QL query. You can find more details at https://developer.couchbase.com/documentation/server/4.6/n1ql/n1ql-language-reference/index.html
-Prasad

Here is an elasticsearch query I’m using to aggregate data. The data is denormalized and is one “table” or document.

A document just has a dept_id, dept_name, sub_dept_id, sub_dept_name, item_id, item_name, location, week_date

Each dept will have about 104 million records max. Currently we have 3 depts. We insert about 3 million documents weekly.

GET weekly_sales/_search
{
  "size": 1, 
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "week_date": {
            "format": "yyyy-MM-dd", 
            "gte": "2016-01-01",
            "lte": "2017-02-21"
          }
          }
        },
        {
          "term": {
            "department_id": "4"
          }
        },
        {
          "term": {
            "sub_dept_id": "20"
          }
        }
      ]
    }
  },
  "aggs": {
    "group_by_location": {
      "terms": {
        "size": 10000,
        "field": "location"
      },
      "aggs": {
        "group_by_item_id": {
            "terms": {
              "size": 10000,
              "field": "item_id"
          },
      "aggs": {
        "total_sales" :{
          "sum": {
            "field": "total_sales"
          }
        },
        "agg_source": {
          "top_hits": {
            "size": 1, 
            "_source": {
              "includes": ["department_id", "dept_name", "sub_dept_id", "sub_dept_name","item_id", "item_name", "location"]
            }
          }
        }
      }
        }
      }
    }
  }
}

My end result is to have something similar to(this is how elasticsearch returns data, couch might return something better or cleaner i hope):

    {
      "grouped_by_location": [
        {    
          {
            "location": 1,
            "grouped_by_item_id: [
              "item_id": 333
              {
                "dept_id":"1",
                "dept_name":"dept1",
                "sub_dept_id":"2",
                "sub_dept_name":"subDeptName",            
                "item_name":"item333",
                "total_sales_sum": 48584943
              }    
            ]
          }
        }
      ]
    }

If you have a document like this:

select * from c;

{
      "department_id": "4",
      "dept_id": "5",
      "dept_name": "lego",
      "item_id": 3734,
      "item_name": "EV3",
      "location": 23,
      "sub_dept_id": "501",
      "sub_dept_name": "robots",
      "total_sales": 392.49,
      "week_date": {
        "format": "yyyy-MM-dd",
        "gte": "2016-01-01",
        "lte": "2017-02-21"
      }
    }

You can issue queries like below:

SELECT location, item_id, SUM(total_sales) AS total_sales_sum
FROM c
GROUP BY location, item_id;

you can manipulate doc document set to group, aggregate and order.
N1QL has additional aggregate functions to create/manipulate arrays as well.