Keys, Sizing, Large Dataset
I went through the sizing guidelines at http://wiki.membase.org/display/membase/Sizing+Guidelines, using our parameters-du-jour:
Number of keys: 1 billion
Key size: 128 byte (i.e. string UUID)
Value size: 4K
Working Set Percentage: 1%
Per Node RAM Quota: 10GB
I popped those number into the formulas and came up with.... 150 machines(!)
I checked the math a few times, and realized that this is because the sizing assumes that all keys + metadata must be in memory on some node in the cluster (or more depending on replicas). So in my case, I require (128 + 120 (key overhead)) bytes X 3 copies X 1 Billion keys = which comes out to about 700GB of memory required across the cluster -- just for the keys.
If I set my key size to 1 byte, my replication to zero, and my working set percentage to zero, then (according to the sizing guidelines) I still require 209 GB of in-memory space (across the cluster) for the overhead on 1 Billion keys.
Is it true that we need to reserve RAM for all of the metadata for all keys, retrieved or not? These two pages ( http://wiki.membase.org/display/membase/Memory+Quotas and http://wiki.membase.org/display/membase/Growing+Data+Sets+Beyond+Memory ) lead me to believe that we do -- or that each node will load the index of data that it contains into RAM).
From an economic standpoint, that would mean that if one is going to have a billion items in storage, most of them had better be paying for themselves. That's not my situation: my use case is one where the working set is much much smaller than my overall data set -- 0.1 % is more like it. So most of my data is "cold", and is _unlikely_ ever to be retrieved again, so I don't want to have to burn valuable memory to point to that data. Do I have any options, from a Membase standpoint?