Views deprecated with no real replacement?

I’ve used Couchbase in the past (8 years ago, time flies!) and loved it, we where able to utilize views to create useful and performant indexes on a massive dataset.

Now I have another big data problem and thought Couchbase could be a good fit again, but starting to catch up with the latest documentation and doing some experiments I must say that I’m disappointed.

I setup a test cluster with a dataset of 184M (~200gb) documents and added a simple index to it

CREATE INDEX ix_num ON bucket (num DESC) WHERE irreversible = true

The index took several hours to build and took up 64.3GiB on disk, stored on a single node.

Creating the same index using the view:

function (doc, meta) {
  if (doc.irreversible) {
	  emit(doc.num, null)
  }
}

Resulted in a index with the size of 24.6GB distributed across 3 nodes (8.2G per node) and it took probably a third of the time to build and queries at about the same speed (and I can order it both directions!)

Then I tried something more complex:

CREATE INDEX ix_tx ON bucket(DISTINCT ARRAY tx.id FOR tx IN transactions END)

That index took forever to build, it didn’t even finish because it filled the disk 100% of the single node that processed it. I think it was at something like 800gb @ 70% complete before falling over.

Now the same using a view:

function (doc, meta) {
   doc.transactions.forEach((tx) => {
     emit(tx.id)
  })
}

Took less than 2 hours to index and is using 84GB on disk across 3 nodes.

I would just shrug and think that the query indexes are for a different use case if it weren’t for the fact that you have deprecated views and just about every page of your documentation that mentions views also mentions that I shouldn’t be using them.

How is this supposed to replace distributed map reduce? And why should I choose Couchbase over PostgreSQL at this point?

@jn you can still use views if this is the only way to approach your use case, I think it’s just discouraged overall compared with newer query mechanisms.

Actually I wonder if for your use case couchbase analytics is a better fit, compared to n1ql queries. It will also distribute the index across analytics nodes and is well suited to handle hundreds of gigabytes with its indexes and ad-hoc queries.

I think trying analytics might be a good idea first, and if that also doesn’t work well for you using views is still possible.

@jn Are you using EE with GSI indexes (Plasma storage) or CE (ForestDB storage)?

If EE (Plasma), for your original index, there will be both main- and back-index entries for each document. If you are indexing all 184M documents (i.e. they all have irreversible == true), then there will be ~368,000,000 index entries consuming 64,300,000,000 bytes on disk (a disk GB is 10^9 bytes, not 2^30 bytes, so about 7% smaller than a memory GB). This implies each entry consumes about 175 bytes. If num is an 8-byte integer, then we are looking at about 167 bytes of overhead per entry in addition to the 8-byte key.

I am not part of the Index storage team so I don’t know if that is unexpectedly high or not. In the 64-bit world this is less than 21 words of overhead per entry, but that does seem like more than I’d guess were needed.

What percentage of your 184M docs have irreversible == true and thus actually get indexed?

However there are also the questions of key mutations and index storage compaction to consider. Fragmentation of the storage will generally be much higher if the index key changes, requiring some multiple of disk storage due to all the obsolete entries that have not been reclaimed yet.

How often does the key, num, get changed? We recommend keys be immutable. The name “num” sounds like it might be a non-immutable counter that changes frequently. If that is the case, documents will have to be reindexed on every key change, and Plasma disk storage is architected as an append-only linear stream so it does not update the old entry but rather marks it as deleted and writes the new entry to the end of the storage stream. This approach is optimized for documents being inserted and (usually less frequently) deleted but will have less desirable behavior if the keys of existing documents are frequently updated. Outside of periodic index compaction runs, an old page of storage cannot be reclaimed until all entries on it and all older pages are marked deleted (“tombstones”).

Periodic index compaction rewrites non-deleted entries of old pages at the end of the stream so the old pages can be reclaimed, but this is a batch type of operation, so if the keys are frequently updated there will always be a background level of new fragmentation generated because of this, and the average state will be that more (possibly a lot more) disk is consumed than just that needed for the live index entries. See

https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/storage-modes.html#standard-index-storage

https://docs.couchbase.com/server/current/manage/manage-settings/configure-compact-settings.html

@Kevin.Cherkauer

I was using CE for this test, it reports the index as being “GSI Standard”.

The dataset is mostly immutable, only a very small percentage of docs at any given time will be irreversible == false and those are the only documents that can mutate, num is stable but for irreversible == false multiple documents can share the same number for a while.

For context the full dataset is ~8 TB, we currently have a custom built indexing solution on top of BadgerDB that stores each entry pretty close to optimally, the transaction index example from above uses ~200GB on disk for the full dataset with our custom solution.

I didn’t expect to come close to that with Couchbase but I also didn’t expect your indexes to have this much overhead. And even if we could get the index size down to something manageable not being able distribute the storage and computation of them like we could with views pretty much just makes it a much less efficient version of what we already have.


@daschl

Our use needs performant and well defined queries not ad-hoc ones so I doubt the analytics engine would work well and building a new system on top of a deprecated database feature is a really hard sell.


I was doing some searching and found this Couchbase Views and Better Alternatives [Part 1 of 2] | The Couchbase Blog

Very sad to see Couchbase take what I think it was its strongest feature and declare it dead, especially as the list in that article - presented as some sort of inherent limitations of views - could have been addressed making them even better.

@jn

GSI Standard indexes in CE use the ForestDB storage engine, a B±tree-based indexing approach. In contrast EE uses the Plasma storage engine, a lock-free skip-list approach. There is not a lot of work done on optimizing ForestDB as Plasma scales better. CE is designed to be a free trial but not to support enterprise-scale deployments, just so you know that scalability of CE is limited.

not being able distribute the storage and computation of them like we could with views

EE supports partitioned indexes, which allows the storage and computation to be spread over essentially as many nodes as you like. CE does not have this feature, so each index is local to one node.

@jn Another thing to note about views vs indexes: views update only every 5 seconds, whereas a new in-memory snapshot is made of each index every 200 milliseconds in ForestDB (CE), or every 10 milliseconds in Plasma or Memory-Optimized indexes (EE). Thus any query that does not specify a consistency time will on average see data that is 2.5 seconds behind current in the views case vs only 100 msec in the index case on CE, a factor of 25x difference, or 5 msec in the index case on EE, a factor of 500x.

@jn

Non-ad hoc queries actually sounds like it could be a good match for Analytics (i.e. more data warehouse style than OLTP).

Thank you @Kevin.Cherkauer

I’ve now setup a cluster using the enterprise edition and got a smaller dataset (87M docs, 118GiB) loaded in and with that and a partitioned index it performs on par with the views.

But not for the more complex cases, I’ve tried a bunch of different index and query combinations but I’m unable to come up with something that can beat this view:

function (doc, meta) {
  doc.transactions.forEach((tx) => {
    tx.actions.forEach((action) => {
      emit([action.receiver, doc.num])
    })
  })
}

This gives me a fast lookup for any action by receiver pageable by document number. The index takes up 1.2GB on disk and took ~4 hours to build (using a single core on each node).

The closest I’ve been able to come to that using query is:

CREATE INDEX ix_receiver ON bucket (
    DISTINCT ARRAY (
        DISTINCT ARRAY a.receiver FOR a IN tx.actions END
    ) FOR tx IN transactions END,
    num
)
PARTITION BY HASH(META().id)

This index took 45min to build (using 8 cores on each node) and takes up 9.19GiB

And querying using:

SELECT tx.*, d.num FROM bucket AS d
UNNEST d.transactions AS tx
UNNEST tx.actions as a
WHERE a.receiver == 'alice'
ORDER BY d.num DESC
LIMIT 10

This works when the receiver only has a couple of actions associated with them but some receivers are present in millions of documents and querying for those takes several minutes.

Is it possible to model the view above as a query index that gives me lookup times independent of how many documents a receiver is in?

@jn I notice the index differs from the view in that the index has two DISTINCT keywords, each of which requires performing a sort operation to eliminate duplicates, whereas the view does not eliminate duplicates and thus does not need to do the sorts. This could be a cause of performance differential.

@Kevin.Cherkauer I’ve tried many different variations of that index including ALL ARRAY ALL ARRAY, that one just makes the index slightly bigger with no performance difference for the worst case

@jn First up, thanks a lot for reaching out and for your feedback on views. As @daschl mentioned, you can continue to use views the old way but the more advanced use cases with scopes & collections are not going to be available with views. And most of the use cases with views are well served by Query & Indexes (N1QL + GSI) as you may have already experienced by now. For the complex view on arrays use case you mentioned, I request our N1QL expert @vsr1 to chime in & help with the best possible Index & Query.

As already noted, partitioned-indexes are more apples-to-apples comparison with views. And Standard GSI Indexes in EE (with plasma) perform & scale quite well. We have some of the largest work loads working on them in the field. In addition, GSI indexes have gotten even better with Couchbase Server EE 7.0.x with reduced resource consumption, increased performance & scale when used in conjunction with a better data model exploiting scopes & collections. You can look at all the latest performance numbers at ShowFast.

We have it on the roadmap to bring in Javascript based custom map & reduce capabilities into N1QL-GSI, so we clearly have a path forward to make ensure we better Couchbase offerings of alternatives to views in future. Please see MB-33228, MB-48270. Thanks again!

@jn, For clairity Views Reference | Couchbase Docs specifically says:

Note: Views are deprecated in Couchbase Server 7.0+. Views support in Couchbase Server will be removed in a future release only when the core functionality of the View engine is covered by other services.

The key thing in this statement as @jeelan.poola also points out is that Views will continue to exist until the core functionality if covered by other services.

I have doing some experiments with mapreduce in Eventing (it’s complex) but I would be more than happy to see if I can solve your use case. My goal here besides trying to help you is to provide insight into parallel GPU like processing that can be integrated into the final replacement for Views - which might be a combination of techniques form both Eventing (think DCP) and also GSI.

If your willing to work with me by sharing your View definitions,the test data set, the velocity of data change, percentage of data that immutable, and typical access patterns just contact me directly at couchbase.com or DM me here and we can schedule some time to kick things off.

Best

Jon Strabala
Principal Product Manager - Server‌

There are two parts here

  1. Map/reduce views. These can be directly used by SDKs/UI.
  2. N1QL has 3 type indexes Couchbase 7 views
    Indexes based on GSI (default, USING GSI)
    Indexes based on FTS (USING FTS)
    Indexes based on Map/reduce view. (USING VIEW). In 7.0 N1QL Queries will not able to use this index (VIEW) functionality removed.