Membase Scalability
I was playing with Membase and now I have some doubts regarding its scalability. Lets take simple use case. Suppose I have already two nodes( join cluster) each having 1 GB of Membase RAM quota and each is stored with some data.
Consider I am adding one new node ( to join as cluster) with 1 GB RAM quota. I have a few questions;
(1) Can I get better performance with 3 nodes compared with performance of 2 nodes?
(2) Is scalability only gives advantage of having more RAM quota size ( here i will get a total of 3GB RAM with 3 nodes, than 2 GB RAM with2 nodes) so as to store more data?
Please reply.
Hi Prasad,
as Steve mentions, customers can attest to Membase scalability and performance.
Ergo, I am wondering what experiments you did so that you didn't feel it was scalable.
How did you try and measure scalability?
Are you using a smart-client? Are you using client side moxi?
Did you have many clients try accessing the data in parallel (which is where a cluster is typically used, as you have lots of users access data at the same time).
Cheers,
Frank
Dear Steve, thanks for the reply. I will explain my questions with simple usecase. Scenario: Suppose I have 1 node with 1 GB of membase RAM quota and stored data above 50% of RAM size;
Nodes | RAM | Total RAM(whole cluster) | Cached data size | Free RAM
----------------------------------------------------------------------------------------------------------------
1st node 1 GB 1 GB 700 MB (1GB - 700MB = 324MB)
Suppose I am adding one new node with same capacity to the above existing cluster, if my understanding is correct, here the replications kicks in and the same data is equally shared in newly added node also. That means;
Nodes | RAM | Total RAM(whole cluster) | Cached data size | Free RAM
------------------------------------------------------------------------------------------------------------------
1st node 1 GB 2 GB 700 MB 324MB
2nd node 1 GB 2 GB 700 MB(replicated data) 324MB
So my questions are : (1) Can I get benefits of extra RAM and disk space of 2nd node in this case? because the data of 1st node is already replicated in 2nd node also. so in both case, only 324MB RAM can be used for storing other data in cache. So what is the advantage of scalability here???? (2) Whether adding more and more nodes gives improvement in performance or not? Then how it can be proved? can you please give simple usecases to prove adding multiple nodes to existing clusters improve performance?
Dear Frank,
Please find the reply inline.
>How did you try and measure scalability?
I am measuring by the following steps
1) Add more and more nodes to existing nodes ( to form as join clustor)
2) rebalance the clustor
3) compare performance against existing nodes vs added nodes.
4) compare storage (RAM space) capability against existing nodes vs added nodes.
>Are you using a smart-client? Are you using client side moxi?
I am using enyim 2.8 .net client
>Did you have many clients try accessing the data in parallel (which is where a cluster is typically used, as >you have lots of users access data at the same time).
Ya in our scenario, multiple clients will access data in parallel.
Hi Prasad,
> So my questions are : (1) Can I get benefits of extra RAM and disk space of 2nd node in this case? because the data of 1st node is already replicated in 2nd node also. so in both case, only 324MB RAM can be used for storing other data in cache. So what is the advantage of scalability here????
Yes, the extra node gives you a few benefits...
* First off, apologies if I wasn't clear on this, but you'll now have half of your "shards" (we call them vbuckets) owned by node 1 as primary for those vbuckets. The other half of vbuckets are owned by node 2 as the primary. So, you've doubled the cpu/network/disk I/O and RAM available for use in the cluster. This is especially easy to think about in the case when you have no replication configured (replica count == 0). Clients (like your enyim client) will hash their keys to the right node and directly contact the correct node to operate on the requested data item.
* Next, with replication configured (say, replica count of 1), with 2 nodes in the cluster, replication will actually now be working for you, compared to just a cluster of a single node. Now if something happens to a single node, you won't have catastrophic issue of 100% data loss.
* The RAM used for replica items often doesn't use the same amount of RAM as for primary items. By that, I mean that after replica items are saved to disk, their memory can be freed for other usage (like for primary items on that node).
> (2) Whether adding more and more nodes gives improvement in performance or not? Then how it can be proved? can you please give simple usecases to prove adding multiple nodes to existing clusters improve performance?
For example, there are load generator tools available... /opt/membase/bin/memcachetest, for example, is one of those and shipped with latest membase versions.
The idea is to generate load in a way so that you aren't limited by client or network limitations (not client machine bounded (such as by a single threaded FOR-loop in your favorite language) or because you're testing from some oversubscribe cafe WiFi hotspot, for example), so that you're testing Membase node performance. A properly configured memcachetest with multiple threads should do it (or more than one client machine, sometimes, running memcachetest). Once you have the #'s for a single node, start adding more nodes -- you'll want to see whether life is better.
Cheers,
Steve
@Steve, Thank you for your valuable comments.
>* The RAM used for replica items often doesn't use the same amount of RAM as for primary items.
If that is the case, could you please suggest how can I confirm/verify the same via Membase web console or through the Enyim client?
> By that, I mean >that after replica items are saved to disk, their memory can be freed for other usage (like for > primary items on >that node).
Ok. Again if we goes back to the simple usecase, which I had mentioned above, I think that I can reuse 1 GB RAM of 2nd node(newly added node) because, as of now, I am getting only 324MB unused RAM space from 2nd node.
For achieving that,
How can I save 700MB replicated data of 1st node to disk?
Is it possible with
(a) Configuration Change (Changing the existing Membase configuration settings) ?
(b) Usage of Code ( Properly using Enyim client library ) ?
Could you please suggest methods to reuse RAM space due to scalability?
Regards
Prasad
> If that is the case, could you please suggest how can I confirm/verify the same via Membase web console or through the Enyim client?
Hi Prasad,
Via the Membase web console, click on MONITOR / Data Buckets, and on the "vbucket resources" section. You'll find the "resident %" for Active and Replica items.
Please, note, by the way, if Membase has extra RAM available to keep items in RAM, it will -- for both the active items and replica items.
> > By that, I mean >that after replica items are saved to disk, their memory can be freed for other usage (like for > primary items on >that node).
>
> Ok. Again if we goes back to the simple usecase, which I had mentioned above,
> I think that I can reuse 1 GB RAM of 2nd node(newly added node) because, as of now, I am getting only 324MB unused RAM space from 2nd node.
> For achieving that,
> How can I save 700MB replicated data of 1st node to disk?
One thing that I'm probably not explaining clearly is that every node in your cluster will become a primary owner of some of the vbuckets (or shards). Here's a description of the vbuckets sharding approach that I like a lot...
http://dustin.github.com/2010/06/29/memcached-vbuckets.html
> Is it possible with
> (a) Configuration Change (Changing the existing Membase configuration settings) ?
> (b) Usage of Code ( Properly using Enyim client library ) ?
This is handled automatically by the Membase software and not actually directly controllable by API or configuration. If you have unused RAM quota available, Membase will try to use it to keep as much data cached in memory for the highest performance.
This is usually what you want, and if you run out of RAM, merely add & Rebalance in another new node into your cluster. The vbuckets or shards will then be evenly spread across your nodes.
> Could you please suggest methods to reuse RAM space due to scalability?
The goal of utilizing RAM as efficiently as possible is very important. If your application's working set doesn't fit in the your cluster's RAM quota, then performance can drastically drop as Membase will automatically handle the retrieving and saving items to/from disk for your requests, but then your application's requests could be gated by disk I/O performance instead of RAM performance.
To manage working set sizes and reuse RAM space efficiently, some techniques that you might consider include using item data compression; actively & explicitly deleting unused items; and using expirations to have Membase automatically reclaim space used by expired items.
Finally, we've found that adding more nodes to your cluster is often the fastest "ops centric" approach (read: no code changes needed or need to wake up the app developers) to handle growing popularity and applications that are succeeding at internet speeds.
Cheers,
Steve
Hi prasad,
> I was playing with Membase and now I have some doubts regarding its scalability.
Membase is deployed in production at some of the highest scalablity websites, social networks/games, online ad networks, etc on the planet. Many folks have found it to be the right technology for their problems.
> (1) Can I get better performance with 3 nodes compared with performance of 2 nodes?
Generally, yes. Membase supports linear scalablity, so as each node gets busy, the ability just add more nodes to a Membase cluster to scale out is a key feature.
> (2) Is scalability only gives advantage of having more RAM quota size ( here i will get a total of 3GB RAM with 3 nodes, than 2 GB RAM with2 nodes) so as to store more data?
The data that Membase can store is limited by disk, not by RAM. However...
The ability have more of your application's working set in RAM is by far and away (much emphasis here) the top consideration you should have in mind, and you'll want to take care to get these configuration and sizing numbers right. Secondarily, the ability to have good disk I/O capabilities on your machines and (with the ability to add more machines) being able to scale-out on your disk I/O is also important. That is, to a rough rule of thumb, running a huge number of nodes, where each node has weakling memory and disks is worse than fewer but more capability nodes that each have decent RAM and disk capability.
Other considerations (like network I/O and CPU) should be kept in mind, too, but are usually not the primary drivers in production Membase deployments.
These are good question!
Cheers,
Steve