Handful of Questions - Hotspots,Emit Atomicity,View Updates,Reduce Results Saving
First off I want to say thank you for your work on CouchBase. It looks to be a very powerful product that will save many people from having to re-invent the wheel. I hope CouchBase continues to grow and becomes superior to Amazon and Google's custom implementations. Considering it's Free Software, it's just a matter of time.
I have a handful of questions regarding CouchBase that I've come up with while reading the docs/examples.
I refer to the 'Bank Example' multiple times in my questions. This is the video I am referring to: http://vimeo.com/39735657 - The Bank part is at 31:30
1) Key Hotspots - Does Couchbase do auto-balancing of load for a single key? Key "frontpage" for example would be accessed very frequently by every client (they could cache it, but then synchronization must be done manually gah). I've read the whitepaper on Google Filesystem, and they had/have this problem too when distributing executables/binaries for a distributed process. Their solution: manually increase the redundancy factor for the hotspot key. I found a similar question on the forums: http://www.couchbase.com/forums/thread/load-balancer ; the answer appears to be a no (and even worse, Couchbase doesn't use the redundancy nodes in round-robin distribution). I would like it if the cluster took care of load balancing for key hotspots (high reads, infrequent writes), including invalidating the in-cluster caches at the appropriate times (I believe this would have to be before the write. If it was after, you have stale data risks. This also might mean that the data is unavailable for a brief moment of time (if we try to read from the cache'd nodes)? I'm unsure).
2) View Emit Atomicity - Are 2 emits in a single View guaranteed to be added to the View's output/index atomically before Reduce processes it? In the bank example, if one emit is in (the balance subtraction) and the next is not (the balance addition), the overall balance will not be zero when doing View.Group().Group to verify all the transactions.
3) How, in the client, to know how often a View is updated? For example, should you use polling to verify the balance in the bank example? Should it be a specialized client (single point of failure!) or should any/every client do this (isn't this inefficient? see question #4)?
4) I know View results/indexes are written to the cluster, but what about filters/reduces? Are they generated only when a client requests them (pretty sure this is a yes)? Seeing as clients can request new filter criteria at runtime, does this mean every filter/reduce with different criteria is stored on the cluster? What if we only want it once? I can think of scenarios where it would be smart to keep it around, and also scenarios where you'd only want it once (so storing it would be a waste of space). Maybe TTLs in this case (except TTLs are lazy, so since you never request it again, it is never deleted :-/).
----EDIT, found this answer http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writi... : "Reduce functions have one final trick up their sleeves, and that's the results of the reduction are stored in the index along with the rest of the view information. This means that when accessing the result of a reduce function in your view is only accessing the index content, and therefore is very low impact compared to calculating the values live when the view is accessed."
But it still makes me wonder if there's a way to NOT save the reduce results.
Also I think I found an error in the docs on that page with the answer, in the _count example: the input has James only but in the output there's Adam, James, and John. OH NVM technically it's not an error considering the elipses indicating there are more values... but sort of hard to see them. At first glance it appears the 3 James values are the only ones. Would improve readability to add the Adam/John input values.
5) Are there any better tap usage examples, including filtering of data? Does the filtering happen on the cluster (efficient) or on the client (inefficient)? Are taps the answer to #3? Can taps even be used with Views? The documentation on Taps is quite lacking (probably because it's an internal API :-/). I hope/imagine Taps and their filters can be set up to stream certain changes based on a key prefix (so you can benefit even more from your 'smart' key design), and that this filtering is done on the cluster?
I might be able to answer some of the parts of #5 myself after I get a test server up (running into dependency issues atm but they are documented so I should be able to get around them), for example the first thing I'm going to do is observe the output of tap_mutation, tap.py, and tap_example.py (perhaps you could consider putting sample output in the documentation for future readers). But I can't answer the "where does the filter run? client or cluster?" question without diving into the Couchbase source. In any case, I'm more interested in the answers to #1-4 than #5.
Thanks again for what appears to be an amazing product (haven't tried it yet),