Store multiple keys on the same node.
Is it possible to group certain keys so they fall on the same vBucket? I have a number of "related" keys that I'd like to keep on the same node.
Does couchbase have the concept of "partition keys" at all? (I didn't see anything like this).
If it doesn't (and it isn't planned), does this concept sound like it would be supported long term?
It appears that I could override the hash algorithm using the CouchbaseConnectionFactoryBuilder. The basic idea would be to extend DefaultHashAlgorithm, override "valueOf" and "computeMd5" to parse the key in a particular fashion:
Keys would be composed of 2 parts seperated by a "~":
RealKeyValue~PartitionKey
So:
Entity/EntityKey~EnittyKey
and
EntityLookup/LookupValue~EntityKey
and
blah/blah~EntityKey
would all be hashed to the same bucket since it would only hash the portion that is after the "~" character.
Is this something that would be a good strategy long term?
Thanks!
There are 3 main benefits to doing this that I can see and one major drawback.
Benefits:
1) The objects that are correlated are all on the same server, so I have a chance of doing some form of atomic update without the overhead of communicating between machines. This is similar to Google's big table concept that you can do transactions as long as all the objects are correlated to the same "parent" object (and thus stored at the same node location in the cluster).
2) To reduce the need for this atomic update, I've consolidated many objects into one document. (See below). This allows me to update those objects using CAS in a single atomic operation. However, it is also making my document on the large side. (1 MB right now, more to come). Thus, transferring this object between the nodes in my system will get cumbersome. I'm contemplating writing the web layer such that requests for a given entity in the system are redirected to the node that is the master of that data. Thus, network transmission of the large object is removed.
3) It reduces my risk of node failure making my entity unavailable. If I need to write to the entity and it deals with N documents that are correlated, then it is possible those objects are spread over N nodes. If any one of those nodes is down, then my entity can't be updated. By putting all the documents on one node, then my risk goes down to only that 1 node. (It's the same trade off with RAID 0 on a hard disk. You get faster performance by doing it, but expose your data to higher risk since either drive failing will eliminate all your data).
The main drawback is that all my documents are on one node so I won't get any parallelism during a write for that object. However, that doesn't feel like a major issue to me.
If I'm thinking of the problem in the wrong manner, then I'd love to be educated how to solve this particular set of issues in a non atomic world. I'm very new to this. I'm used to being able to say "OPEN TRANSACTION, WRITE ROW, WRITE ROW, WRITE ROW, COMMIT". :)
More details below:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I'm building a game similar in game play style to chess on top of couchbase. I have a number of entities that all work together to make the game work properly. In order to get around the issue of lack of transactions, I've put the entire "game state" into a single object that represents the game "right now". Now I can atomically update it using CAS, and the world is good.
But, I also want to keep other objects that are related to the game state in the membase bucket. For instance, a historical log of what happened in the game. Ideally, this would be a deterministic log so that you could replay the entire game from the starting conditions.
This poses a problem. I don't want to put the history into the game state, because you only need it in a rare circumstance. The game state itself is already getting on the heavy side (around 1 MB) due to putting all the correlating objects into a single document, and I don't want to add a bunch of state that isn't typically needed. This poses the problem of how to update both objects during an update so that the game is consistent. If the two objects are on separate nodes, then I run the risk of not being able to write one of the objects when I could write the other one. (And this only gets worse when dealing with more than 2 objects).
I'm probably thinking of the problem in the wrong manner, but I'm not used to writing code that is working in a non-atomic world.
Let's assume that I can manage to consistently update the multiple objects. Then, we run into the issue of game availability. If I want to update the game, I've exposed myself to the risk that 2 different nodes being down could "make my game unavailable for updates". If I have N objects and they are spread over N nodes, then any of those nodes going down would make my game unavailable. By putting all the game related objects on the same node, my risk goes down since I'm only reliant on a single node being the master for all my game objects.
Another way to do this would be by just placing your own vbucket value for a given message. If you look at the memcached protocol there is a field called the "data type" field and that is where we place the vbcuket number after we hash the key. I don't necessarily recommend doing this though because the hashing algorithm will make sure your keys are spread equally around the cluster. What benefit do you see from putting all of your keys in the same vbucket?