Scaling index reads (Performance)
Hi,
I'm very excited about couchbase 2.0, and i'm investigating the possibility of getting replacing Cassandra with Membase from our current stack. After having run Cassandra in production for some time, we've seen it's shortcomings firsthand and want to find a better solution.
The new indexing and query features in couchbase are very appealing, however i'm curious as to the performance implications of querying the indexes, and i haven't been able to find any detailed documentation or descriptions about the a) the implementation or b) it's current performance/scaling characteristics.
I've read somewhere that the system uses scatter-gather, without further information which leads me to believe that the implementation works something like this:
- Each server in the cluster maintains an index of the objects residing on that server (built incrementally by underlying couchdb store)
- A query is sent to all servers (either via a smart client library or from inside the server) - the scatter phase -, and the results are then gathered on that one machine and sent back to the client with irrelevant data removed.
So, actual questions:
a) Is that how it works?
b) If yes, does that not imply that we will never be able to scale index reads above the capabilities of the smallest box in the cluster (and also that index reads will use processing on ALL machines in the cluster)
c) if yes, is there a roadmap or simliar for moving to some sort of distributed indexes where index reads don´t go across the entire cluster but exhibit linearly scaling characteristics when adding servers like the underlying key/value system?
Thank you,
Oliver
Hey Jan,
Thanks for the detailed reply (and subreplies, you´re a programmer, right?)
Are there any benchmarks/performance tests available that can give us a sense of the performance limits of the current index query system?
With regards to c), what about distributing each index across vbuckets ( or simliar ), with some sort of overall understanding of what range of index data are in each vbucket?
Again, thanks for the reply!
Best,
Oliver
Hi Oliver,
Thanks for the detailed reply (and subreplies, you´re a programmer, right?)
Yea :)
Are there any benchmarks/performance tests available that can give us a sense of the performance limits of the current index query system?
Not currently, but we are working on them. You can make your own, too :)
With regards to c), what about distributing each index across vbuckets ( or simliar ), with some sort of overall understanding of what range of index data are in each bucket?
We are currently exploring biggerbangforthebuck opportunities, but I'm sure we'll look into this as well at some point ;)
Cheers
Jan
--
Hi Oliver,
thanks for writing. You've got it mostly right. In detail:
a) Currently, yes.
b) Correct, I think we recommend to not under-provision cluster nodes, so you get a good baseline speed there.
b.1) Correct again.
c) Totally, we are currently researching different ways to improve the current systems. We are focussing on making views as simple, fast and elastic (excuse the marketing spiel, I think it fits nicely here :) as you'd expect from us. Areas of improvement include caching results, caching view ranges (does the scatter gather has to ask a node at all and smarter and more efficient per-node data aggregation as well as moving some query-time computations to index-time.
Let us know if you have any other questions.
Cheers
Jan
--