Limit in document number

Hi,

In one of other similar posts I left a comment, but that post was not answered, so I decided to open a new question.

It seems that Couchbase has very frustrating limitation on number of documents able to be stored in a node/cluster, depending on the total memory.

My calculations; taking EC2 Large instance as a base and using the calculation provided in the CB2 manual I get the following:

Preliminary data:
1. EC2 Large instance memory 7.5GB and only up to some 5.5 GB may be dedicated to a bucket. So we use 5.5GB.
2. Suppose we don't have replica (no_of_copies = 1 + number_of_replicas (0) = 1).
3. Document Key length - 20b (ID_size=20).
4. Meta data is 120b (This is the space that Couchbase needs to keep metadata per document. It is 120 bytes. All the documents and their metadata need to live in memory at all times and take no more than 50% of memory dedicated to a bucket.).
5. Suppose intended number of documents is 100M.
6. We even do not take into account the total documents disk size for simplicity

Memory needed = (documents_num) * (metadata_per_document + ID_size) * (no_of_copies)

100,000,000 * (120+20) * 1 = 14,000,000,000b = 13GB
13GB * 2 (50% of memory) = 26GB (memory needed to have 100M bucket operating)
26GB / 5.5 = 4.73 = 5 EC2 Large instance

So, that means that for storing 100M documents in EC2 L units we need to have at least 5 instances. This number of documents is nothing for a more or less serious project, though the cost of the servers (I am not even talking about Couchbase Support License Fees) will be unspeakable.

And if we want to have a replica than the number is duplicated.

These calculations are very raw. Using right more complicated way will not make it significantly different, if not even worse.

I really hope that I am not right on my calculations and I am looking forward to being disproved, because I like very much the product for a Number of aspects and really would like to use it in our projects.

1 Answer

« Back to question.

The metadata amount per item is actually down near about 60 bytes. In previous versions it was higher, but the latest manual should have the right numbers (http://www.couchbase.com/docs/couchbase-devguide-2.0/more-on-metadata.html). I would also say that it is not a hard requirement to have the metadata be less than 50% of the RAM, rather a guideline to ensure you still have enough space to get the benefit of caching the actual document values.

Overall you are correct that the current software requires all keys and their metadata to be in RAM at all times. There are very good reasons for doing this as it leads to extremely fast lookup times not only for data that we do have, but for items that we do not have (a "miss"). Rather than spending 10's of ms (or more) scanning an index on disk, we can return these requests in sub-1ms.

Of course there are resource costs to this, but that is how Couchbase provides its very predictable and low latency.

One of the next major improvements to the software will be to relax this constraint and allow more of this information to reside on disk. This will have to be a configureable setting as the expectation of a memcached-like system is that it can provide the response times that we do.

I can tell you of many users of Couchbase that actually store well over 1B items in their cluster. Yes it takes money and resources to do so, but the idea being that they are using the scale and performance to make even more money with their particular application. It's also worth saying that Couchbase is certainly not a hammer for all nails and there will be applications that do not need the performance that it provides...therefore other solutions may fit better. It's up to each individual application to evaluate the tradeoffs between performance, simplicity, functionality and cost (to name a few).

Hope that provides some better context.

Perry