Best design for performance?

I'm trying to learn a bit about the various noSQL stuff out there and started with couchbase. We have a SQL database with a couple of fact tables with a few hundred million rows in each one... doing any sort of analysis of data in the whole table is... arduous. So I figured that could be a good set of data to play with.

The SQL tables look like this:

-factMetricFloats
-300 million rows
-FK links to host, metric and configuration dimensions
-sample date (generally 15 second intervals)
-metric value as a float (double in .NET)
-factMetricInts ... same as above table, but with integer values instead of floats
-dimension tables for metric and host... just have key and name as a string
-configuration dimension has a slew of string attributes in it

I've tried to create the documents a few different ways in couchbase. The most recent way I tried, I created a single data bucket and used an incrementing key for the internal key (using the .NET CLI Increment method). All the documents have a "type" key/value that indicates if it is a host, a metric, a configuration, or a sample entry. I used 0,1,2,3 instead of strings. The sample document looks something like this:

{
   "levelkey": 2162129,
   "metrickey": 3102,
   "hostkey": 38,
   "sampledt": "2013-03-02T14:11:35",
   "value": 1,
   "type": "0"
}

I created a view to give me the sum, count, average, and eightieth percentile of the value grouped by the three keys (ignoring date).

I have two nodes, each with 16GB RAM and 4 vCPU. Disk backing is an FC disk array, but I see very little in the way of IO on the two nodes when they are under load. I loaded in about 27 million items to play with.

My view (at the bottom of this post), takes about 2 minutes to return the data each time I run it. CPU on the hosts seems to be around 40%. Data seems excessive given the small amount of data... looks like it takes about 11GB... the entire SQL database is about 25GB.

I assume I'm doing something wrong and I can't find any resources that offer examples or advice for designing datamodels for couchbase that don't power a blog site (if I were doing that, there are plenty of examples). If I'm not doing anything wrong... am I trying to use the "wrong tool for the job"?

The view...
map:

function (doc, meta) {
  if (doc.type == 0) {
    emit([doc.levelkey,doc.metrickey,doc.hostkey],doc.value);
  }
}

reduce:

function (key, values, rereduce)
{
  var result = {total: 0, count: 0, eightieth: 0, average: 0};
  if (rereduce)
  {
    var vals = [];
    for(i=0; i< values.length; i++)
    {
      result.total = result.total + values[i].total;
      result.count = result.count + values[i].count;
      vals[i] = values[i].eightieth;
    }
    vals[values.length] = result.eightieth;
    result.eightieth = sum(vals) / vals.length;
  }
  else
  {
    values.sort();
    result.total = sum(values);
    result.count = values.length;
    result.eightieth = values[Math.floor(values.length * 0.8)];
  }
 
  result.average = result.total / result.count;
  return(result);
}

1 Answer

« Back to question.

Hi,
that it takes two minutes for the results should be the case. Can you supply the code/request you do when querying? I would suspect that you use stale=false as a parameter.
A custom reduce will slow down the MapReduce View creation, but shouldn't slow down the queries. Nonetheless can you try the built-in _stats function [1] instead? I know it doesn't have the 80th percentile, but at least we can see if this causes additional slowness.
(1) http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writi...

Cheers,
Volker

I'm pretty sure it is the custom function... I have that sort on the values array and sorts are the most expensive operation. If I limit my map function to include only a specific metrickey the query takes a second (after the index is built).
The code isn't anything fancy.

static void _queryMoreData()
        {
            CouchbaseClient db = new CouchbaseClient();
 
            foreach (var row in db.GetView("test", "all_metrics").GroupAt(3))
            {
                object[] keys = (object[])row.Info["key"];
                if (keys[1] == null)
                {
                    continue;
 
                }
                if (keys[1].ToString() == "2203")
                {
                    Dictionary<string, object> vals = (Dictionary<string, object>)row.Info["value"];
                    try
                    {
                        Console.WriteLine(string.Join("\t", keys) + "\t" + vals["total"].ToString() + "\t" + vals["count"].ToString() + "\t" + vals["average"].ToString() + "\t" + vals["eightieth"].ToString());
                    }
                    catch { }
                }
            }
 
 
        }

My reply wasn't completely accurate. Indeed there is some part of the reduce, that needs to be run on query time.

So yes, it's your custom reduce together with the the group_level of 3 is probably the major slow down.
To speed up things you could queries on smaller ranges, currently you query the full View if I read that correctly.

Another things you might want to try is using the HTTP API directly, to make sure the SDK isn't the bottle neck.

Cheers,
Volker