Working Set
Before we can decide how much memory will we need for the cluster, we should understand the concept of a 'working set'. The 'working set' at any point of time is the data that your application actively uses. Ideally you would want all your working set to live in memory.
Memory quota
It is very important that a Membase cluster is sized in accordance with the working set size and total data you expect.
The goal is to size the RAM available to Membase so that all your keys, the key meta data, along with the working set values fit into memory in your cluster, just below the point at which Membase will start evicting values to disk (the High Water Mark).
How much memory and disk space per node you will need depends on several different variables, defined below.
Calculations are per bucket
Calculations below are per bucket calculations. The calculations need to be summed up across all buckets. If all your buckets have the same configuration, you can treat your total data as a single bucket, there is no per-bucket overhead that needs to be considered.
Inputs
Table 4.11. Input Variables
| Variable | Description |
|---|---|
| keys_num | The total number of keys you expect in your working set |
| key_size | The average size of keys |
| value_size | The average size of values |
| number_of_replicas | number of copies of the original data you want to keep |
| working_set_percentage | The percentage of your data you want in memory. |
| per_node_ram_quota | How much RAM can be assigned to Membase |
Constants
The following are the items that are used in calculating memory required and are assumed constants.
Table 4.12. Constants
| Constant | Description |
|---|---|
| Meta data per key (metadata_per_key ) | This is the space that Membase needs to keep metadata per key. It is 120 bytes. All the keys and their metadata need to live in memory at all times |
| SSD or Spinning | SSDs give better I/O performance. |
| headroom_percentage | typically 25% for SSD and 30% for Spinning as SSD are faster than Spinning Disks |
| High Water Mark percentage (high_water_mark_percentage) | by default it is set at 70% of memory allocated to the node |
The Working Set Size is the percentage of total data you want in-memory. This is a rough guideline to size your cluster:
Table 4.13. Variables
| Variable | Calculation | Comments |
|---|---|---|
| no_of_copies | = 1 + number_of_replicas | |
| total_metadata | = (keys_num) * (metadata_per_key+key_size) * (no_of_copies) | All the keys need to live in the memory |
| total_dataset | = (keys_num) * (value_size) * (no_of_copies) | |
| working_set | = total_dataset * (working_set_percentage) | |
| Cluster RAM quota required | = (total_metadata + working_set) * (1+headroom_percentage)/(high_water_mark_percentage) | |
| number of nodes | =Cluster RAM quota required/per_node_ram_quota |
Number of Nodes You will need at least the number of replicas + 1 nodes irrespective of your data size.
Example sizing calculation
Table 4.14. Input Variables
| Input Variable | value |
|---|---|
| keys_num | 1000,000 |
| key_size | 100 |
| value_size | 10000 |
| number_of_replicas | 1 |
| working_set_percentage | 20% |
Table 4.15. Constants
| Constants | value |
|---|---|
| Type of Storage | SSD |
| overhead_percentage | 25% |
| metadata_per_key | 120 |
| high_water_mark | 70% |
Table 4.16. Variable Calculations
| Variable | Calculation | Description |
|---|---|---|
| no_of_copies | = 2 | 1 for original and 1 for replica |
| total_metadata | = 100,0000 * (100 + 120) * (2) = 440,000,000 | |
| total_dataset | = 100,0000 * (10000) * (2) = 20,000,000,000 | |
| working_set | = 20,000,000,000 * (0.2) = 4,000,000,000 | |
| Cluster RAM quota required | = (440,000,000 + 4000,000,000) * (1+0.25)/(0.7) = 7928,000,000 |
if you have 8GB machines and you want to use 6 GB for Membase:
number of nodes = Cluster RAM quota required/per_node_ram_quota = 7.9 GB/6GB = 1.3 or 2 nodes
RAM quota
You will not be able to allocate all your machine RAM to the per_node_ram_quota as there maybe other programs running on your machine.
Disk space
Disk space is required to persist data. How much disk space you should plan for is dependent on how your data grows. You will also want to store backup data on the system. A good guideline is to plan for at least 130% of the total data you expect. 100% of this is for data backup and 30% for overhead during file maintenance.