Is it possible to have persistent private (local) buckets within a Couchbase cluster?
Hello there,
My team and I are in the process of selecting a NoSQL solution for our project and so far Couchbase seems to be the most appropriate for the task, however there is a problem we've encountered in our evaluation - namely, we need to have both public and private data in our cluster, where some of the data can only exist on the node that added it.
To better explain what we're trying to do here's the scenario: We have a very tiny application layer we intend to host on the same server where a Couchbase node would run, instead of separating application and DB servers, to save on the hardware costs and reduce some network complexity. In this case Couchbase would serve as a data glue between application layers across the cluster by the means of a replicated bucket (for the sake of reference, let's call it public). Thus, each app in the cluster would connect to a local (tho, not necessary in the CB system) Couchbase instance and do CRUD operations on the mutually shared and replicated public bucket. So far, so good, it works flawlessly and scales brilliantly (albeit a bit slower with the rebalancing than we expected - it can take good 5-10 minutes to rebalance an empty 512MB bucket across two nodes in the same LAN).
However, our app needs to keep some of the data locally until certain conditions are met. This data should never be available to the other nodes in the cluster - it should never leave the server, in any way, it arrived at before the aforementioned conditions are met. So, in order to save the time and reduce complexity we thought we could use the same DB code and data handling strategy as we already have implemented for the public bucket, but just replace the referenced bucket with a private one (let's call it private), and then when the conditions are met just to migrate portions of the data to the public bucket.
However, with Couchbase, it's not as easy as it initially appeared to be. First we thought that if we turn off the replication on a bucket it won't be available to the other nodes in the cluster, but not only it is available, all the data from it instantly gets stored on the other nodes so even if we turn off a node where we created the bucket with turned off replication - the data is still available to the rest of the nodes, which kind of defeats the idea of turned off replication if you ask me. Even memcached buckets are accessible and visible from all other nodes in the same fashion, they just have no persistence - not to mention that we need persistence for the private bucket as well.
So, the question is - is there a way to do what we need with Couchbase - i.e. to have certain, persistent buckets available and visible only to the certain nodes in the cluster? Is it even possible with the way Couchbase was envisioned?
Sure, we can run a separate unattached Couchbase instance in a VM, or use a non-related DB solution for the private data or store the private data encrypted in the public bucket but we're trying to build an elegant solution which would eat minimal resources and won't congest the network with the data we need to exist only locally, not to mention an uniform approach to data I/O no matter if it's local or public. Any help?
Sorry for the long post, but I wanted to be as clear as possible.
Thanks in advance.
Also to move the data, you should be able to backup data from a live cluster from "private" bucket and restore to "public" bucket.
backup and restore are online, cluster-wide and can be run on the bucket level if needed. restore also has an "add" option, that checks for existence of the key and inserts the key only if it doesn't exist.
"First we thought that if we turn off the replication on a bucket it won't be available to the other nodes in the cluster, but not only it is available, all the data from it instantly gets stored on the other nodes so even if we turn off a node where we created the bucket with turned off replication - the data is still available to the rest of the nodes, which kind of defeats the idea of turned off replication if you ask me. "
I think you are asking if a bucket can be created only on a set of nodes. Let's say we have 10 nodes in the cluster. You want to create the bucket on only 2 of those nodes.
This is not possible. Reason being that, we using hash partitioning. This means that every bucket created by default always has 1024 partitions that get spread uniformly across all the servers in the cluster. This is a very powerful concept, because it allows us to have smart clients that know exactly where the data lives based on the partition map. It also uniformly distributes loads across clusters without any hot spots.
So the purpose of replication isn't so much isolation of data as it is availability. And the creation of a bucket across all nodes simply means that only a subset of the data is available on each of the nodes. Its not the same data on every node.
So let's go back to our example. If you have a 10 node cluster and you create a bucket with 0 replicas. This gets distributed on all 10 nodes. If you have 10000 documents, then each node will have 100 documents and each partition roughly 1. so in some sense it is what you call private. The data created on each node stays there. it does not get replicated or transferred if there are no replicas configured.
hope this helps.
- D