Limit in document number

Hi,

In one of other similar posts I left a comment, but that post was not answered, so I decided to open a new question.

It seems that Couchbase has very frustrating limitation on number of documents able to be stored in a node/cluster, depending on the total memory.

My calculations; taking EC2 Large instance as a base and using the calculation provided in the CB2 manual I get the following:

Preliminary data:
1. EC2 Large instance memory 7.5GB and only up to some 5.5 GB may be dedicated to a bucket. So we use 5.5GB.
2. Suppose we don't have replica (no_of_copies = 1 + number_of_replicas (0) = 1).
3. Document Key length - 20b (ID_size=20).
4. Meta data is 120b (This is the space that Couchbase needs to keep metadata per document. It is 120 bytes. All the documents and their metadata need to live in memory at all times and take no more than 50% of memory dedicated to a bucket.).
5. Suppose intended number of documents is 100M.
6. We even do not take into account the total documents disk size for simplicity

Memory needed = (documents_num) * (metadata_per_document + ID_size) * (no_of_copies)

100,000,000 * (120+20) * 1 = 14,000,000,000b = 13GB
13GB * 2 (50% of memory) = 26GB (memory needed to have 100M bucket operating)
26GB / 5.5 = 4.73 = 5 EC2 Large instance

So, that means that for storing 100M documents in EC2 L units we need to have at least 5 instances. This number of documents is nothing for a more or less serious project, though the cost of the servers (I am not even talking about Couchbase Support License Fees) will be unspeakable.

And if we want to have a replica than the number is duplicated.

These calculations are very raw. Using right more complicated way will not make it significantly different, if not even worse.

I really hope that I am not right on my calculations and I am looking forward to being disproved, because I like very much the product for a Number of aspects and really would like to use it in our projects.

Do worry you can have more documents then what you can fit in memory. Only the most recent set or get documents will be in memory(Working set) ... the rest of the items will be on disk. so your limit is HD.

I would recommend that you only use about 25-30% of your HD to store the documents. You have to take into Compaction and CBBACK(making dumps of your documents) .

househippo thank you very much for you comment. Indeed, 25-30% HD recommendation may be a good one (Y) .
But I think you confuse the documents and meta data of documents. All meta data must reside in the memory, while the documents' body may be partially only on HD.

1 Answer

« Back to question.

The metadata amount per item is actually down near about 60 bytes. In previous versions it was higher, but the latest manual should have the right numbers (http://www.couchbase.com/docs/couchbase-devguide-2.0/more-on-metadata.html). I would also say that it is not a hard requirement to have the metadata be less than 50% of the RAM, rather a guideline to ensure you still have enough space to get the benefit of caching the actual document values.

Overall you are correct that the current software requires all keys and their metadata to be in RAM at all times. There are very good reasons for doing this as it leads to extremely fast lookup times not only for data that we do have, but for items that we do not have (a "miss"). Rather than spending 10's of ms (or more) scanning an index on disk, we can return these requests in sub-1ms.

Of course there are resource costs to this, but that is how Couchbase provides its very predictable and low latency.

One of the next major improvements to the software will be to relax this constraint and allow more of this information to reside on disk. This will have to be a configureable setting as the expectation of a memcached-like system is that it can provide the response times that we do.

I can tell you of many users of Couchbase that actually store well over 1B items in their cluster. Yes it takes money and resources to do so, but the idea being that they are using the scale and performance to make even more money with their particular application. It's also worth saying that Couchbase is certainly not a hammer for all nails and there will be applications that do not need the performance that it provides...therefore other solutions may fit better. It's up to each individual application to evaluate the tradeoffs between performance, simplicity, functionality and cost (to name a few).

Hope that provides some better context.

Perry

Perry, your answer is very helpful and "hopefulness-giving" (if there is a such a word in English :) ). Thank you very much!

True, good things should be at the extend of a price I agree with it. My concern was about a possible "over-limitation" of the aspect I described above compared with its overall price, but your answer clarified a lot, mostly regarding the downsize of the meta, config tweak and future improvements.

Currently I had 2-3 such cases when Couchbase reached its 50% limitation, and what I got is a numerous logs like: "Metadata overhead warning. Over 50% of RAM allocated to bucket "XXX" on node "XXXX" is taken up by keys and metadata. (repeated 19 times) ".
And they keep going every 10-20 seconds consuming up the web log page altogether. I am talking about 2.0.1 version.

Is there any way to stop that kind of logs, or is there any config value that will increase that threshold ?
Or maybe the new 2.1 version already has some improvements over it?

Thank you again!

I believe the over zealous messaging has been made much better in 2.1...could you give that a try and report back?