Handful of Questions - Hotspots,Emit Atomicity,View Updates,Reduce Results Saving
Hi,
First off I want to say thank you for your work on CouchBase. It looks to be a very powerful product that will save many people from having to re-invent the wheel. I hope CouchBase continues to grow and becomes superior to Amazon and Google's custom implementations. Considering it's Free Software, it's just a matter of time.
I have a handful of questions regarding CouchBase that I've come up with while reading the docs/examples.
I refer to the 'Bank Example' multiple times in my questions. This is the video I am referring to: http://vimeo.com/39735657 - The Bank part is at 31:30
1) Key Hotspots - Does Couchbase do auto-balancing of load for a single key? Key "frontpage" for example would be accessed very frequently by every client (they could cache it, but then synchronization must be done manually gah). I've read the whitepaper on Google Filesystem, and they had/have this problem too when distributing executables/binaries for a distributed process. Their solution: manually increase the redundancy factor for the hotspot key. I found a similar question on the forums: http://www.couchbase.com/forums/thread/load-balancer ; the answer appears to be a no (and even worse, Couchbase doesn't use the redundancy nodes in round-robin distribution). I would like it if the cluster took care of load balancing for key hotspots (high reads, infrequent writes), including invalidating the in-cluster caches at the appropriate times (I believe this would have to be before the write. If it was after, you have stale data risks. This also might mean that the data is unavailable for a brief moment of time (if we try to read from the cache'd nodes)? I'm unsure).
2) View Emit Atomicity - Are 2 emits in a single View guaranteed to be added to the View's output/index atomically before Reduce processes it? In the bank example, if one emit is in (the balance subtraction) and the next is not (the balance addition), the overall balance will not be zero when doing View.Group().Group to verify all the transactions.
3) How, in the client, to know how often a View is updated? For example, should you use polling to verify the balance in the bank example? Should it be a specialized client (single point of failure!) or should any/every client do this (isn't this inefficient? see question #4)?
4) I know View results/indexes are written to the cluster, but what about filters/reduces? Are they generated only when a client requests them (pretty sure this is a yes)? Seeing as clients can request new filter criteria at runtime, does this mean every filter/reduce with different criteria is stored on the cluster? What if we only want it once? I can think of scenarios where it would be smart to keep it around, and also scenarios where you'd only want it once (so storing it would be a waste of space). Maybe TTLs in this case (except TTLs are lazy, so since you never request it again, it is never deleted :-/).
----EDIT, found this answer http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writi... : "Reduce functions have one final trick up their sleeves, and that's the results of the reduction are stored in the index along with the rest of the view information. This means that when accessing the result of a reduce function in your view is only accessing the index content, and therefore is very low impact compared to calculating the values live when the view is accessed."
But it still makes me wonder if there's a way to NOT save the reduce results.
Also I think I found an error in the docs on that page with the answer, in the _count example: the input has James only but in the output there's Adam, James, and John. OH NVM technically it's not an error considering the elipses indicating there are more values... but sort of hard to see them. At first glance it appears the 3 James values are the only ones. Would improve readability to add the Adam/John input values.
5) Are there any better tap usage examples, including filtering of data? Does the filtering happen on the cluster (efficient) or on the client (inefficient)? Are taps the answer to #3? Can taps even be used with Views? The documentation on Taps is quite lacking (probably because it's an internal API :-/). I hope/imagine Taps and their filters can be set up to stream certain changes based on a key prefix (so you can benefit even more from your 'smart' key design), and that this filtering is done on the cluster?
I might be able to answer some of the parts of #5 myself after I get a test server up (running into dependency issues atm but they are documented so I should be able to get around them), for example the first thing I'm going to do is observe the output of tap_mutation, tap.py, and tap_example.py (perhaps you could consider putting sample output in the documentation for future readers). But I can't answer the "where does the filter run? client or cluster?" question without diving into the Couchbase source. In any case, I'm more interested in the answers to #1-4 than #5.
Thanks again for what appears to be an amazing product (haven't tried it yet),
d3fault
Thank you for answering, it has cleared a lot of things up for me.
1) While I certainly agree that a memory based solution is faster than Google's hard disk based solution, I can't help but wonder if there would still be bottlenecks in other places, such as the network link. A gigabit ethernet cable maxes out at ~125 MB/s, so only there could only be ~125 simultaneous users accessing a single 1mb document. Perhaps I'm just getting too obsessed with scalability theory (interesting stuff yo), but I could see that as being a problem in a large cluster/application. Essentially what I'm getting at is that single keys do not scale horizontally. The work-around solution is simple: clients cache the hotspot key. Take this as a distant feature suggestion in making CouchBase a swiss army knife of distributed databases (because it looks like you've got your hands full with the feature-rich 2.0 release).
4) I'm confused on what you mean by: "ad-hoc queries are really only useful for development". I was under the impression they were the solution to querying (based on a user supplied string, for example) the data in a production environment. By development, do you mean when a coder is writing his client interface and is merely checking the results of something he recently did (making sure it worked etc)... or do you mean a dedicated client (not serving users) behind the scenes that is directing the overall flow/organization of data across the cluster (since he is not dealing with users, he does not have latency constraints)? In any case, I meant neither of those when I said I thought querying was the solution for doing a string search (example) in a production environment. My [erroneous?] method would have had the client/client-interface doing it on behalf of the user.
d3fault
I've put more thought into that feature idea, but what it really depends on is this:
Can you _detect_ a key hotspot currently? I'm not comfortable enough with the Couchbase source code to want to try to figure that out.
If you CAN, great. Here's a design elaborating on what I said earlier to scale a single hotspot key horizontally:
First, like I said already, set up another node (or nodes) to do caching. These caching nodes know who the actual owner of the key is. The actual owner and the smart client will have to communicate with each other somehow. The actual owner could lie to the smart client and say that the cache node is where the key resides... or it can be honest and let him know it's just a cache (knowing every cache node + the actual would allow the smart client to do round robin himself. But if he is "lied to", then the actual key owner node will be the one doing the round robin distributing. It doesn't matter, but there are options).
This is easy to understand and only handles reads. Writes are a bit tricker but still possible:
The smart client (or lied-to client, whichever) does a write on the key. Whether he does it to the actual key owner or to the cache (thinking it's the actual owner) depends on the above (but it doesn't matter). If he writes to the cache, the cache just forwards the write to the actual. Once it hits the actual owner, the actual key owner node invalidates every cache that he instantiated, then performs the write. If a read comes in to any of the caches during the period they are invalidated, they would just ask the actual owner for the value.... which is a blocking operation until the actual owner finishes the write. Alternatively, the actual owner could push the new value out to the caches (simultaneously marking them valid again), but this makes no difference.
The idea is that a key hotspot node owner only instantiates more caches as needed..... but theoretically can use every node in the cluster to cache it. Writing becomes more expensive... but for such an extreme case with so many reads, it is worth it.
It doesn't even sound that hard to implement for someone comfortable with the code base... but depends entirely on whether or not you can detect key hotspots. 2.0 release maybe :-D?
d3fault
1) You are correct that if you maxing out the network will limit your scalability. We rarely run into normal customer scenarios where this is the case and in our view if we allow the network to be your bottleneck then we have done a good job building our server. As our clients mature it is likely that we will added some built in caching mechanism in the clients. At present you could certainly use a third party solution along side of our clients.
4) To me one of the reasons for choosing a NoSQL database is speed. To have the fastest possible queries you want to have the indexes defined before query time. Ad-hoc queries are built on demand and are therefore slower so I was just making this point. I think I went a little bit to far to say that they are only useful for development. I think you have a good understanding of what's going on here.
I've put more thought into that feature idea, but what it really depends on is this:
Can you _detect_ a key hotspot currently? I'm not comfortable enough with the Couchbase source code to want to try to figure that out.
If you CAN, great. Here's a design elaborating on what I said earlier to scale a single hotspot key horizontally:
First, like I said already, set up another node (or nodes) to do caching. These caching nodes know who the actual owner of the key is. The actual owner and the smart client will have to communicate with each other somehow. The actual owner could lie to the smart client and say that the cache node is where the key resides... or it can be honest and let him know it's just a cache (knowing every cache node + the actual would allow the smart client to do round robin himself. But if he is "lied to", then the actual key owner node will be the one doing the round robin distributing. It doesn't matter, but there are options).
This is easy to understand and only handles reads. Writes are a bit tricker but still possible:
The smart client (or lied-to client, whichever) does a write on the key. Whether he does it to the actual key owner or to the cache (thinking it's the actual owner) depends on the above (but it doesn't matter). If he writes to the cache, the cache just forwards the write to the actual. Once it hits the actual owner, the actual key owner node invalidates every cache that he instantiated, then performs the write. If a read comes in to any of the caches during the period they are invalidated, they would just the actual owner for the value.... which is a blocking operation until the actual owner finishes the write. Alternatively, the actual owner could push the new value out to the caches (simultaneously marking them valid again), but this makes no difference.
The idea is that a key hotspot node owner only instantiates more caches as needed..... but theoretically can use every node in the cluster to cache it. Writing becomes more expensive... but for such an extreme case with soooooo many reads, it is worth it.
It doesn't even sound that hard to implement for someone comfortable with the code base... but depends entirely on whether or not you can detect key hotspots. 2.0 release maybe :-D?
d3fault
err I keep hitting the spam filter when I try to edit it. Just wanted to add that it applies to 1.8 too :), so could be back-ported easily and released as 1.8.1 or something
d3fault
edit: wtf double post too? sorry about that
edit 2: since adding code to detect key hotspots (keeping count of which keys are accessed and how often etc) would potentially add slowdown (unless it already exists? in which case ignore this), I think it'd be a great compile time switch. #define SCALE_HOTSPOT_KEYS_HORIZONTALLY and then users that don't need that functionality are not affected by it.
Today's free feature day. I give you free features/designs and you give me FREE SOFTWARE :-D. Fair trade eh? Fuck yea Richard Stallman.
edit 3: If by chance one of the replica nodes are instantiated as a cache, it can refer to the same piece of data as a memory saving technique. It would still need to be aware that it is performing 2 roles simultaneously (it invalidates reads when 'actual' says to, but is PART OF the write that it is waiting on (you DO replicate before reporting the write as complete to the client, don't you?)). Perhaps it would be smartest to intantiate the repica nodes as cache nodes first anyways (or maybe smartest to use node with least activity)?
If I understand correctly your idea is to set up one or more caching nodes for heavily read keys and to incorporate these caching nodes into our cluster and only use them for heavily read keys. These caching nodes would be part of the cluster and smart clients would know whether or not to go to them or to go to a normal Couchbase node.
In my opinion we actually already do something like this. In Couchbase each node contains a memcached layer at the highest level. If we have a Couchbase server that has enough memory to hold 3 keys and space for more keys on disk then this server will be able function at over 100k ops/sec if there are only 3 keys since all of our data accesses will come from memory. When we start adding more keys things get a little bit more interesting. If we have more then 3 keys than what is important is that only 3 keys are in use at any given time. So basically you don't want your maximum working set to exceed the amount of memory you have in the cluster. Doing this will rovide sub-millisecond latencies for a key no matter how hot it is.
My point here is that all of you active keys are already cached by us so having another caching server is in my opinion not likely to give you any major benefits, unless of course this caching mechanism is built into your client application.
Let me know if you have any other questions or ideas.
Allow me to elaborate using maths,
Suppose you have a very popular blog, let's say example.com
example.com goes through a load balancer that uses round robin distribution.
Behind the load balancer is... say... 20 webservers. These are the servers that the load balancer hands off to.
Each webserver has it's own internet data link, let's say at 100mbps (12.5 MB/s) each, to serve users of example.com directly.
The webservers are also CLIENTS to a Couchbase cluster on an internal LAN. Let's say 100 couchbase nodes (but this number does not matter, because 1 key is only ever located on 1 node at any given moment (excluding replicas)). The entire internal LAN is hooked up via gigabit links.
If you post a 1mb file (binary, whatever) on the front page of your blog, which is backed by the Couchbase cluster, that 1mb file is going to become a hotspot.
Let's say 1000 people are on example.com when you post the file, and all of them try to download it at the same time. I chose 1000 people because it is a relatively low number these days.
The webservers will be able to keep up: every user gets 250 KB/s download speeds. (20 webservers * 12.5 MB/s each = 250 MB/s overall throughput... divided by 1000 users simultaneously downloading the file = 250 KB/s for each user).
The bottleneck becomes the couchbase node that holds the 1mb file. Not because of memory I/O, but because it maxes out it's gigabit link to the webservers/couchbase-clients.
That single node is responsible for serving every user on every client (1000 people means 1000 requests unless you do additional caching in your webserver/couchbase-client), and it only has a single 1gb link to do it.
A gigabit link is 125 MB/s.
125 MB/s divided by 1000 users downloading the same file = 125 KB/s each. This is half the capacity of your webservers' maximum overall throughput.
20 Webservers (100mb links) can serve 250 kb/s to 1000 users
1 Couchbase node (1gb link) can serve 125 kb/s to 1000 users
See the problem?
I got lucky those numbers came out so cleanly lol... didn't plan for that.
The purpose of instantiating neighbor Couchbase nodes as caches is to utilize the neighbor's 1gb link to the webservers/couchbase-clients (this is assuming the nodes don't share a router or something in the LAN, in which case the router becomes the bottleneck). In this case you'd only need to instantiate 1 'cache' to help you meet your webserver's maximum throughput of 250 KB/s when 1000 users are online downloading a single file ("key" in Couchbase).
I do understand what you're saying about such a caching strategy taking up a portion of your neighbor's RAM (and therefore working set), but I think under such circumstances it is well worth the cost (the size of the document is insignificant... it's the frequent accesses we're trying to remedy).
..................
On Another Note:
I'm starting to wonder if Couchbase is the product I want. It seems perfect, but there appears to be too much emphasis on 'keeping all documents in memory'.
Is it safe to say that Couchbase is a product more similar to Amazon ElastiCache than Amazon DynamoDB?
That being said, wouldn't Couchbase perform and function just fine with a working set 5 times the size of the overall cluster's available RAM?
I am OK with cold cache gets. Keep the most frequently accessed documents in RAM, yea... that's a common sense caching mechanism.
If the answer to this is: "Yes, Couchbase performs perfectly well with a working set much larger than the total available RAM in the cluster", then GREAT (I will probably use it!).... but I have one business suggestion: don't put so much emphasis on making sure the entire working set fits into RAM (unless you're selling the 'sub millisecond latency' bullet point that is so frequently repeated). For most use cases, having the entire working set in memory is not necessary. When it IS necessary, the person setting everything up will know and will size their RAM/Cluster accordingly.
Is Couchbase a:
"Distributed Database"?
or
"In-Memory Cloud Cache" (to quote ElastiCache)?
(Functionally, it is both... but the marketing/documentation makes it sound more like an In-Memory Cloud Cache. You might be scaring off potential customers this way)
I think Couchbase's history with Memcached explains why there's so much emphasis on 'keeping every document in memory'.
Lastly, is there a way to make Couchbase skip the warmup period? Tbh, I'm ok with a cold cache get for the first get of every key. I'm also a bit confused on how Couchbase determines which portion of the working set (assuming it's larger than the node has RAM) it loads. Say I have 2gb on the hard drive but only 1gb RAM for that node... does it just get the first 1gb during the warmup? This isn't an issue, I'm just curious (so long as everything is still accessible (cold or not)).
And now, joke time:
I'm getting... cold feet.
Your history with memcached might be... clouding your vision.
rofl. I'm here till Tuesday.
Keep up the great work guys, I can be a bit nit picky at times I know.
d3fault
1. I understand that you are trying to lower the bandwidth used with certain network connections. This however isn't something we frequently hear about and that can be implemented client side. You could for example just set a cache and place keys in it with a low time to live (maybe 15 seconds) and then just refetch the item from Couchbase every 15 seconds so that the item isn't stale. If we get more requests for this feature then we will certainly consider adding it.
2. We have many customers who don't keep their entire working sets in memory and things work fine. Just realize that you are forcing requests to hit disk and this will slow the average latency for requests.
3. Couchbase is a distributed database. We persist all of your data onto disk.
4. Warmup must take place since we have to load all of the key data into memory so that requests can take place. We then load in all of the values that were in memory prior to shutdown. This second step could potentially be skipped, but I don't think we currently provide the functionality to skip it.
I am thinking about doing it in the client, but I really think the functionality belongs in the cluster itself. Implementing it in the client would have the downside of: replicating and persisting the caches (a cache needs neither). A 15 second TTL would definitely be easier to implement, but also means you can receive stale data for a maximum of 15 seconds. The design I outlined above would have atomic writes. The more caches instantiated, the more expensive... but it would be well worth it for such hypothetical extremes. No client would ever get stale data.
Should I file this as an Improvement in the Issue tracker?
and thank you for helping me regain my confidence in using Couchbase as a distributed database :)
d3fault
Yes, please file an improvement for this.
1) We haven't had any issues with hot keys because of our in memory caching layer. This layer provides sub-millisecond response times and you can do over 100k ops/sec on even a single Couchbase node. Also, we have a replica read api, but I wouldn't suggest using it because I don't think it is necessary in most cases. I invite you to run a bunch of clients and hammer on the same key. I think you will find little to no performance drop.
3) You can provide a parameter to view requests stale=false that will guarentee that the view is up to date before you get the results back. You can also specify stale=update_after to get old view results but kick off the update process you your next query will be up to date.
4) My understanding is that we only store a single copy of the index. The query parameters simply allow you to scan the index to return only certain parts of whats in the index. In the future we will likely add support for ad-hoc queries, but for now we don't have them. Also, ad-hoc queries are really only useful for development. Anything you want to be fast should not be done ad-hoc.
5) Filtering of tap streams must be done client side. In the future we will likely add filters of the server-side since tap streams can be very useful for building applications. Tap streams are mostly used internally but we do plan on making them more public later. Take a look at the Couchbase java client since it has the most mature tap stream interface and is pretty easy to use.
I feel that I might have left a few things out so please let me know if you have any other questions or if there was anything I missed (Aside from question 2 since I don't think I'm the best person to answer that one).