Suitable use case?
I need to store ~1 billion chunks of ~4kb worth of data (i.e. about 4TB when fully deployed, will add data incrementally). Need real-time access. The data will be accessed fairly randomly, but sometimes sequentially: the chunks string together in series, with ~1 million series or so, sometimes I need to read things one series at a time, but usually just the last chunk in each series. Some chunks are much smaller than 4kb - they may be as small as 4 bytes (maybe 25% of them or so are significantly smaller than 4kb).
1. Is it a good idea to use membase for this type of scenario?
2. Does membase store the values efficiently more or less regardless of size? (i.e. is there a block-ness (= min size) to what gets stored? Overhead per item I assume is small?)
My plan was to use membase on EC2, small instances (1.7GB ram, 1.3GB usable by membase, 1TB EBS disk each), and add nodes as needed as I add data.
3. Has anyone deployed something similar? Any oops/gotchas with the setup outlined above? My assumption was that this will be severely disk (seek) limited, so going for higher CPU or bandwidth EC2 instances would be wasted?
Thanks!
I need to store ~1 billion chunks of ~4kb worth of data (i.e. about 4TB when fully deployed, will add data incrementally). Need real-time access. The data will be accessed fairly randomly, but sometimes sequentially: the chunks string together in series, with ~1 million series or so, sometimes I need to read things one series at a time, but usually just the last chunk in each series. Some chunks are much smaller than 4kb - they may be as small as 4 bytes (maybe 25% of them or so are significantly smaller than 4kb).
1. Is it a good idea to use membase for this type of scenario?
2. Does membase store the values efficiently more or less regardless of size? (i.e. is there a block-ness (= min size) to what gets stored? Overhead per item I assume is small?)
Membase is designed for random access to billions of small data items and the software we use and datafile format are optimized for small random access All your data would be distributed throughout the cluster. Membase does not support sequential access out of the box, you will have to use application level logic for that. If the data is really small, you may club it together as one value.
There is a per key overhead of ~150 bytes.
My plan was to use membase on EC2, small instances (1.7GB ram, 1.3GB usable by membase, 1TB EBS disk each), and add nodes as needed as I add data.
The number of nodes you will need will depend on how much data you need in-memory as against on-disk. Most deployments try to keep about 80% of data in memory (we call it working set). How much data do you have currently? What is the expected growth? I can tell you how many nodes will you need to begin with, if you told me more about your deployment and data characteristics.
3. Has anyone deployed something similar? Any oops/gotchas with the setup outlined above? My assumption was that this will be severely disk (seek) limited, so going for higher CPU or bandwidth EC2 instances would be wasted?
You can look at our list of customers here http://www.membase.com/customers . Some customers have fairly large clusters. Whether the cluster would be disk limited or CPU/bandwidth will depend upon how much data will be in memory.
Thanks!
Thank you for trying Membase out! Let me know if you have more questions.
Forum support is great for free but sometimes you need a guaranteed response time and dedicated resources for your questions or issues.
Consider purchasing enterprise-level support from Membase: http://www.membase.com/products-and-services/overview
Call or email "sales -at- membase -dot- com" today!