Keys, Sizing, Large Dataset
I went through the sizing guidelines at http://wiki.membase.org/display/membase/Sizing+Guidelines, using our parameters-du-jour:
Number of keys: 1 billion
Key size: 128 byte (i.e. string UUID)
Value size: 4K
Replicas: 2
Working Set Percentage: 1%
Per Node RAM Quota: 10GB
I popped those number into the formulas and came up with.... 150 machines(!)
I checked the math a few times, and realized that this is because the sizing assumes that all keys + metadata must be in memory on some node in the cluster (or more depending on replicas). So in my case, I require (128 + 120 (key overhead)) bytes X 3 copies X 1 Billion keys = which comes out to about 700GB of memory required across the cluster -- just for the keys.
If I set my key size to 1 byte, my replication to zero, and my working set percentage to zero, then (according to the sizing guidelines) I still require 209 GB of in-memory space (across the cluster) for the overhead on 1 Billion keys.
Is it true that we need to reserve RAM for all of the metadata for all keys, retrieved or not? These two pages ( http://wiki.membase.org/display/membase/Memory+Quotas and http://wiki.membase.org/display/membase/Growing+Data+Sets+Beyond+Memory ) lead me to believe that we do -- or that each node will load the index of data that it contains into RAM).
From an economic standpoint, that would mean that if one is going to have a billion items in storage, most of them had better be paying for themselves. That's not my situation: my use case is one where the working set is much much smaller than my overall data set -- 0.1 % is more like it. So most of my data is "cold", and is _unlikely_ ever to be retrieved again, so I don't want to have to burn valuable memory to point to that data. Do I have any options, from a Membase standpoint?
Cheers,
Paul
Paul, thanks for your inquiry. Your assumptions and calculations are all correct.
In order to provide the absolutely best performance, Membase keeps all of the metadata and indecices in RAM. One of the major advantages of Membase/memcached is the ability to very quickly tell the application that a piece of data DOESN'T exist...rather than a traditional database which can take multiple seconds just to return to the application that it doesn't have the data you were asking for.
We actually have a number of customers running with over 1 billion active keys. It does take a lot of RAM, but the performance benefits are well worth it.
We've got some higher-order action items to reduce the amount of overhead required, but I don't have a good timeframe for that at the moment.
In the end, Membase is focused on provide consistent and predictably low latency access to your dataset. If that's not something that is important to you, Membase may not be the right solution.
Perry
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Couchbase: http://www.couchbase.com/products-and-services/overview
Call or email "sales -at- couchbase-dot- com" today!