Partial View Index on META().id is not Translated into Javascript Mapper Correctly

When creating a partial index using view the resulting code for mapper in JavaScript simply will not work:

CREATE INDEX meta_id ON `beer-sample`(META().id) WHERE META().id <> "" USING VIEW;

Query results:

{
  "results": []
}

The resulting view meta_id within design document _design/ddl_meta_id is coded like this:

function (doc, meta) {
  if (meta.type != "json") return;

  var stringToUtf8Bytes = function (str) {
    var utf8 = unescape(encodeURIComponent(str));
    var bytes = [];
    for (var i = 0; i < str.length; ++i) {
        bytes.push(str.charCodeAt(i));
    }
    return bytes;
  };

  var indexFormattedValue = function (val) {
    if (val === null) {
      return [64];
    } else if (typeof val == "boolean") {
      return [96, val];
    } else if (typeof val == "number") {
      return [128, val];
    } else if (typeof val == "string") {
      return [160, stringToUtf8Bytes(val)];
    } else if (typeof val == "object") {
      if (val instanceof Array) {
        return [192, val];
      } else {
        var innerKeys = [];
        for (var k in val) {
          innerKeys.push(k);
        }
        innerKeys.sort()
        var innerVals = [];
        for (var i in innerKeys) {
          if (typeof val[innerKeys[i]] == "object" && 
            (val[innerKeys[i]] === null || Object.keys(val[innerKeys[i]]).length === 0)) {
            if (val[innerKeys[i]] === null) {
                innerVals.push([64])
            } else if (Object.keys(val[innerKeys[i]]).length === 0) {
              innerVals.push([224, [[], []]]);
            }
          } else {
            innerVals.push(indexFormattedValue(val[innerKeys[i]]));
          }
        }
        return [224, [innerKeys, innerVals]];
      }
    } else {
        return undefined;
    }
  };

  var key1 = indexFormattedValue(doc.meta.id);
  var key = [key1];
  var pos = key.indexOf(undefined);
  if (pos == 0) {
    return;
  } else if (pos > 0) {
    key.splice(pos)
  }

  if (! (doc.meta.id == "")) {
     emit(key, null);
  }

}
// salt: 1031304417

Note the lines

var key1 = indexFormattedValue(doc.meta.id);

and

if (! (doc.meta.id == "")) {

Here meta.id must be used, not doc.meta.id, which is just incorrect.

I tested this on Couchbase versions 4.5.0, 4.6.2, and 5.0.0 Beta 2, and did not found a corresponding issue in the bugtracker.

Actually, this is a huge issue for me as my Couchbase installation contains up to 7.5 billion documents, and using a full index (USING VIEW) requires prohibitive amount of memory. Using a GSI-powered index would require massive network exchange between the nodes, as no data locality is provided in that case, and the index node will become a bottleneck as GSI index is not distributed, and all that is not desirable as I try to have latencies as low as possible (250 microseconds now).

Is there a workaround or a misunderstanding? Simply changing the automatically generated view doesn’t help, as the index engine, obviously, verifies the content of the mapper and refuses to use a modified one.

I’m from the query team. We are aware of this problem, and are trying to figure out what to do about it. Hold on.

We are encouraging customers to move to GSI in 4.x and above.
In 4.x you can have duplicate equivalent indexes and in 5.0 you can simply set the replication factor each index.

/cc @venkat @siri

Hi - as views needs to scatter/gather from all nodes, query latencies may not be very low. If you can create several smaller indexes using CREATE INDEX … WHERE , i.e., manually partition using the WHERE clause on CREATE INDEX, GSI may be an option for you.

Note that GSI runs a component (“projector”) alongside KV nodes which trims the document to only the fields that is actually required to build the index, so the entire document is generally not sent from KV to GSI nodes.

It’s hard to predict performance without more detailed data, but it may be worth comparing GSI and Views to see what performs better in your use case, especially if using 5.0 beta EE standard GSI storage mode (“plasma”) is an option.

Currently I tested GSI views on a test installation with 1.4 billion documents, and a single node with the view was loaded up to 314MB/s (2.5Gbit/s), that was too much, and the build process haven’t finished yet after 24 hours of work. For a production cluster with 7.5 billion documents, using GSI views seems practically impossible.

My use case does not require low latency when querying a view since I intend to use views for analysis purposes via N1QL, and that should not affect latency when querying bucket itself. Views are distributed by design, they work well on my installation in other similar use cases and do not affect performance of kv get/set operations.

Combined all the above is why I intended to use views instead of GSI, and that’s unfortunate that a bug is preventing me from using the power of N1QL. I cannot include document id in document itself as that would require high overhead on memory consumption.

@johan_larson, is there a ticket for that bug in JIRA? If you can point to places in code where a fix should be applied, I could implement it since I use a patched build of Couchbase 4.5.0 anyways. (Current fix enables work in IPV6-only environment, if you are interested.)

@keshav_m, What practice should I use for indexes on a bucket with 7.5 billion documents and 30 or more nodes?

Bug MB-19372 includes the problem you are describing, among other things.

I would discourage you from trying to patch the view index code, because we plan to deprecate view indexes in upcoming releases anyway, meaning the effort will ultimately be for naught. Instead tweak your data model to include the document id inside each document itself, and use that field in your queries rather than meta().id. That should work fine.

@rogozhnikov.andrey Would highly appreciate if you could take Couchbase Server 5.0 with GSI on a spin with similar workloads and let us know of the performance.

Since you are mentioning 7.5B documents, it is best that you partition the indexes(GSI) based on your query pattern; essentially first formulate the query and then look at indexes to make it faster.

Can help further, but need to understand many other variables, like your node size, data update patterns, if you need covered indexes or not, document size, key size, etc.

Johan, thanks for clarifying the state of view indexes as a feature. Sure, GSI should receive an ability to partition indexes automatically since it will eventually become the only index engine. I hope, the views themselves will remain.

Unfortunately, using 5.0 would require updating our patch for support of IPv6 to account for changes made since 4.5.0, so it’s not an easy task. As I can tell from tickets in JIRA, there is an ongoing effort to support IPv6 out of the box, isn’t it? What version is planned to have it implemented?

I tried to gradually narrow down subsets of keys matching the WHERE clause when creating index: 108 million, 35 million, tens on thousands.

CREATE INDEX my_index ON my_bucket(META().id) WHERE META().id LIKE "prefix%" USING GSI;

Each time, initial progress was rapid and stopped at 73%, further progress was too slow to wait for it to finish. It looks like the speed does not depend on the count of keys in a subset. Maximum progress was 96% for 48 hours on a subset of 84 million keys. UPD: An index for 8958 keys required 3 hours to build. Obviously, something works not fine.

The cluster has 12 nodes, each with 125 GB of memory, documents are about 100-200 bytes, records are updated constanly, may be 10% are updated per 24 hours. Keys are about 10-40 bytes, indexes for my tasks initially need to contain only meta.id.