Growing/Shrinking You Cluster
Over the lifetime of your Membase cluster, there will very likely be cause to add or remove nodes. This can be based upon a variety of factors including RAM/disk capacity and available network bandwidth. The process of adding or removing nodes from your Membase cluster consists of a rebalancing process which at its core is meant to provide continuous availability of data.
The same process can be used to perform software upgrades and hardware refreshes by removing a node, making changes and then rebalancing it back into the cluster. If possible, for these operations it is a best practice to add the additional nodes first before removing nodes.
Growing your cluster
There are a few metrics off of which to base your decision to grow a Membase cluster:
1. Increasing RAM capacity. Arguably the single most important component in a Membase cluster is the amount of available RAM. While filling up available RAM is not necessarily cause to grow a cluster, it should certainly be taken into consideration.
As you see more disk fetches occurring (also evidenced by a growing "cache miss ratio"), that means that your application is requesting more and more data from disk that is not available in RAM. Increasing the RAM in a cluster will allow it to store more data and therefore provide better performance to your application.
If you want to add more buckets to your Membase cluster you will likely need more RAM to do so. Adding nodes will increase the overall capacity of the system and then you can shrink any existing buckets in order to make room for new ones.
Increasing disk IO throughput. - By adding nodes to a Membase cluster, you will increase the aggregate amount of disk IO that can be performed. This is especially important in high-write environments, but can also be a factor when you need to read large amounts of data from the disk.
Increasing disk capacity. - This one should go without saying. You can either add more disk space to your current nodes or add more nodes to add aggregate disk space to the cluster.
Network bandwidth. If you see that you are or are close to saturating the network bandwidth your cluster, this is a very strong indicator of the need for more nodes. More nodes will cause the network bandwidth to be spread out further which will reduce the bandwidth of each node individually.
Shrinking your cluster
Choosing to shrink a Membase cluster is a more subjective decision. It is usually based upon cost considerations and/or not needing as large a cluster to support the load requirements of your application.
When deciding to shrink a cluster, it is extremely important to ensure you have enough capacity in the remaining nodes to support your dataset as well as your application load.
It is also not recommended to remove multiple nodes at once, rather one at a time to understand the impact on the cluster as a whole.
It is recommended that you remove a node rather than fail it over. When a node fails and is not coming back to the cluster, the failover functionality will promote its replica vBuckets become active immediately. If a healthy node is failed over, there might be some data loss for the replication data that was in flight during that operation. Using the remove functionality will ensure that all data is properly replicated and continuously available.
Once you decide to add or remove nodes to your Membase cluster, there are a few things to take into consideration:
If you're planning on adding and/or removing multiple nodes in a short period of time, it is best to add them all at once and then kick-off the rebalancing operation rather than rebalance after each addition. This will reduce the overall load placed on the system as well as the amount of data that needs to be moved.
Choose a quiet time for adding nodes. While the rebalancing operation is meant to be performed online, it is not a "free" operation and will undoubtedly put increased load on the system as a whole in the form of disk IO, network bandwidth, CPU resources and RAM usage.
It is our recommended best practice to do any "voluntary" (i.e., not to resolve a failure) rebalancing operation during a period of low usage of the system. Obviously with today's 24x7 web applications there may not be a period of complete inactivity but it is up to the administrator to understand the impact that a rebalancing operation may have on the cluster and application.
Memory required for rebalancing. The rebalancing operation requires moving large amounts of data around the cluster. The more RAM that is available will allow the operating system to cache more disk access which will allow it to perform the rebalancing operation much faster. If there is not enough memory in your cluster the rebalancing maybe quite slow. It is recommended that you don't wait for your cluster to reach full capacity before adding capacity and rebalancing.