I’ve used Couchbase in the past (8 years ago, time flies!) and loved it, we where able to utilize views to create useful and performant indexes on a massive dataset.
Now I have another big data problem and thought Couchbase could be a good fit again, but starting to catch up with the latest documentation and doing some experiments I must say that I’m disappointed.
I setup a test cluster with a dataset of 184M (~200gb) documents and added a simple index to it
CREATE INDEX ix_num ON bucket (num DESC) WHERE irreversible = true
The index took several hours to build and took up 64.3GiB on disk, stored on a single node.
Creating the same index using the view:
function (doc, meta) {
if (doc.irreversible) {
emit(doc.num, null)
}
}
Resulted in a index with the size of 24.6GB distributed across 3 nodes (8.2G per node) and it took probably a third of the time to build and queries at about the same speed (and I can order it both directions!)
Then I tried something more complex:
CREATE INDEX ix_tx ON bucket(DISTINCT ARRAY tx.id FOR tx IN transactions END)
That index took forever to build, it didn’t even finish because it filled the disk 100% of the single node that processed it. I think it was at something like 800gb @ 70% complete before falling over.
Now the same using a view:
function (doc, meta) {
doc.transactions.forEach((tx) => {
emit(tx.id)
})
}
Took less than 2 hours to index and is using 84GB on disk across 3 nodes.
I would just shrug and think that the query indexes are for a different use case if it weren’t for the fact that you have deprecated views and just about every page of your documentation that mentions views also mentions that I shouldn’t be using them.
How is this supposed to replace distributed map reduce? And why should I choose Couchbase over PostgreSQL at this point?